Learn how to measure the quality of your AI capabilities by running evaluations against ground truth data.
The evaluation framework described here is in active development. Axiom is working with design partners to shape what’s built. Contact Axiom to get early access and join a small group of teams shaping these tools.
The Measure stage is where you quantify the quality and effectiveness of your AI . Instead of relying on anecdotal checks, this stage uses a systematic process called an to score your capability’s performance against a known set of correct examples (). This provides a data-driven benchmark to ensure a capability is ready for production and to track its quality over time.
The Eval
function
Eval
function, which will be available in the @axiomhq/ai/evals
package. It provides a simple, declarative way to define a test suite for your capability directly in your codebase.
An Eval
is structured around a few key parameters:
data
: An async function that returns yourcollection
of{ input, expected }
pairs, which serve as your ground truth.task
: The function that executes your AI capability, taking aninput
and producing anoutput
.scorers
: An array ofgrader
functions that score theoutput
against theexpected
value.threshold
: A score between 0 and 1 that determines the pass/fail condition for the evaluation.
Here is an example of a complete evaluation suite:
Grading with scorers
input
, the generated output
, and the expected
value, and must return a score.
Running evaluations
axiom
CLI.
This command will execute the specified test file using vitest
in the background. Note that vitest
will be a peer dependency for this functionality.
Analyzing results in the console
eval.*
attributes, allowing you to deeply analyze results in the Axiom Console.
The Console will feature leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.
What’s next?
Once your capability meets your quality benchmarks in the Measure stage, it’s ready to be deployed. The next step is to monitor its performance with real-world traffic.
Learn more about this step of the Rudder workflow in the Observe docs.