axiom eval from the CLI, offline evaluation results appear in the Console and the CLI provides a link to view them. For example:
- How well does this configuration perform?
- How does it compare to previous versions?
- Which tradeoffs are acceptable?
Compare configurations
To understand the impact of changes, compare evaluation runs to see deltas in accuracy, latency, and cost.Using the Console
Run your evaluation before and after making changes, then compare both runs in the Axiom Console:Using the baseline flag
For direct CLI comparison, specify a baseline evaluation ID:The
--baseline flag expects a trace ID. After running an evaluation, copy the trace ID from the CLI output or Console URL to use as a baseline for comparison.gpt-4o-mini to gpt-4o might show:
- Accuracy: 85% → 95% (+10%)
- Latency: 800 ms → 1.6 s (+100%)
- Cost per run: 0.020 (+900%)
Investigate failures
When test cases fail, click into them to see:- The exact input that triggered the failure
- What your capability output vs what was expected
- The full trace of LLM calls and tool executions
- Do failures cluster around specific input types?
- Are certain scorers failing consistently?
- Is high token usage correlated with failures?
Experiment with flags
Flags let you test multiple configurations systematically. Run several experiments:Track progress over time
For teams running evaluations regularly (nightly or in CI), the Console shows whether your capability is improving or regressing across iterations. Compare your latest run against your initial baseline to verify that accumulated changes are moving in the right direction.What’s next?
- To learn how to use flags for experimentation, see Flags and experiments.
- To iterate on your capability based on evaluation results, see Iterate.