Prerequisites
Follow the Quickstart:- To run evals within the context of an existing AI app, follow the instrumentation setup in the Quickstart.
- To run evals without an existing AI app, skip the part in the Quickstart about instrumentalising your app.
Write evalulation function
TheEval function provides a simple, declarative way to define a test suite for your capability directly in your codebase.
The key parameters of the Eval function:
data: An async function that returns your collection of{ input, expected }pairs, which serve as your ground truth.task: The function that executes your AI capability, taking aninputand producing anoutput.scorers: An array of scorer functions that score theoutputagainst theexpectedvalue.metadata: Optional metadata for the evaluation, such as a description.
/src/evals/ticket-classification.eval.ts.
/src/evals/ticket-classification.eval.ts
Set up flags
Create the filesrc/lib/app-scope.ts:
/src/lib/app-scope.ts
Run evaluations
To run your evaluation suites from your terminal, install the Axiom CLI and use the following commands.| Description | Command |
|---|---|
| Run all evals | axiom eval |
| Run specific eval file | axiom eval src/evals/ticket-classification.eval.ts |
| Run evals matching a glob pattern | axiom eval "**/*spam*.eval.ts" |
| Run eval by name | axiom eval "spam-classification" |
| List available evals without running | axiom eval --list |
Analyze results in Console
When you run an eval, Axiom AI SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. Axiom enriches the traces witheval.* attributes, allowing you to deeply analyze results in the Axiom Console.
The results of evals:
- Pass/fail status for each test case
- Scores from each scorer
- Comparison to baseline (if available)
- Links to view detailed traces in Axiom
Additional configuration options
Custom scorers
A scorer is a function that scores a capability’s output. Scorers receive theinput, the generated output, and the expected value, and return a score.
The example above uses two custom scorers. Scorers can return metadata alongside the score.
You can use the autoevals library instead of custom scorers. autoevals provides prebuilt scorers for common tasks like semantic similarity, factual correctness, and text matching.
Run experiments
Flags let you parameterize your AI behavior (like model choice or prompting strategies) and run experiments with different configurations. They’re type-safe via Zod schemas, and you can override them at runtime. The example above uses theticketClassification flag to test different language models. Flags have a default value that you can override at runtime in one of the following ways:
-
Override flags directly when you run the eval:
-
Alternatively, specify the flag overrides in a JSON file.
And then specify the JSON file as the value of theexperiment.json
flags-configparameter when you run the eval:
What’s next?
A capability is ready to be deployed when it meets your quality benchmarks. After deployment, the next steps can be the following:- Baseline comparisons: Run evals multiple times to track regression over time.
- Experiment with flags: Test different models or strategies using flag overrides.
- Advanced scorers: Build custom scorers for domain-specific metrics.
- CI/CD integration: Add
axiom evalto your CI pipeline to catch regressions.