Skip to main content
The evaluation framework described here is in active development. Axiom is working with design partners to shape what’s built. Contact Axiom to get early access and join a focused group of teams shaping these tools.
The Measure stage is where you quantify the quality and effectiveness of your AI . Instead of relying on anecdotal checks, this stage uses a systematic process called an to score your capability’s performance against a known set of correct examples (). This provides a data-driven benchmark to ensure a capability is ready for production and to track its quality over time. Evaluations (evals) are systematic tests that measure how well your AI features perform. Instead of manually testing AI outputs, evals automatically run your AI code against test datasets and score the results using custom metrics. This lets you catch regressions, compare different approaches, and confidently improve your AI features over time.

Prerequisites

Follow the Quickstart:
  • To run evals within the context of an existing AI app, follow the instrumentation setup in the Quickstart.
  • To run evals without an existing AI app, skip the part in the Quickstart about instrumentalising your app.

Write evalulation function

The Eval function provides a simple, declarative way to define a test suite for your capability directly in your codebase. The key parameters of the Eval function:
  • data: An async function that returns your collection of { input, expected } pairs, which serve as your ground truth.
  • task: The function that executes your AI capability, taking an input and producing an output.
  • scorers: An array of scorer functions that score the output against the expected value.
  • metadata: Optional metadata for the evaluation, such as a description.
The example below creates an evaluation for a support ticket classification system in the file /src/evals/ticket-classification.eval.ts.
/src/evals/ticket-classification.eval.ts
import { Eval, Scorer } from 'axiom/ai/evals';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';
import { flag, pickFlags } from '../lib/app-scope';
import { z } from 'zod';

// The function you want to evaluate
async function classifyTicket({ subject, content }: { subject?: string; content: string }) {
  const model = flag('ticketClassification.model');
  
  const result = await generateObject({
    model: wrapAISDKModel(openai(model)),
    messages: [
      {
        role: 'system',
        content: `You are a customer support engineer classifying tickets as: spam, question, feature_request, or bug_report.
        If spam, return a polite auto-close message. Otherwise, say a team member will respond shortly.`,
      },
      {
        role: 'user',
        content: subject ? `Subject: ${subject}\n\n${content}` : content,
      },
    ],
    schema: z.object({
      category: z.enum(['spam', 'question', 'feature_request', 'bug_report']),
      response: z.string()
    }),
  });

  return result.object;
}

// Custom exact-match scorer that returns score and metadata
const ExactMatchScorer = Scorer(
  'Exact-Match',
  ({ output, expected }: { output: { response: string }; expected: { response: string } }) => {
    const normalizedOutput = output.response.trim().toLowerCase();
    const normalizedExpected = expected.response.trim().toLowerCase();

    return {
      score: normalizedOutput === normalizedExpected,
      metadata: {
        details: 'A scorer that checks for exact match',
      },
    };
    });
  }
);

// Custom spam classification scorer
const SpamClassificationScorer = Scorer(
  "Spam-Classification",
  ({ output, expected }: {
    output: { category: string };
    expected: { category: string };
  }) => {
    const isSpam = (item: { category: string }) => item.category === "spam";
    return isSpam(output) === isSpam(expected) ? 1 : 0;
  }
);

// Define the evaluation
Eval('spam-classification', {
  // Specify which flags this eval uses
  configFlags: pickFlags('ticketClassification'),
  
  // Test data with input/expected pairs
  data: [
    {
      input: {
        subject: "Congratulations! You've Been Selected for an Exclusive Reward",
        content: 'Claim your $500 gift card now by clicking this link!',
      },
      expected: {
        category: 'spam',
        response: "We're sorry, but your message has been automatically closed.",
      },
    },
    {
      input: {
        subject: 'FREE CA$H',
        content: 'BUY NOW ON WWW.BEST-DEALS.COM!',
      },
      expected: {
        category: 'spam',
        response: "We're sorry, but your message has been automatically closed.",
      },
    },
  ],
  
  // The task to run for each test case
  task: async ({ input }) => {
    return await classifyTicket(input);
  },
  
  // Scorers to measure performance
  scorers: [SpamClassificationScorer, ExactMatchScorer],
  
  // Optional metadata
  metadata: {
    description: 'Classify support tickets as spam or not spam',
  },
});

Set up flags

Create the file src/lib/app-scope.ts:
/src/lib/app-scope.ts
import { createAppScope } from 'axiom/ai';
import { z } from 'zod';

export const flagSchema = z.object({
  ticketClassification: z.object({
    model: z.string().default('gpt-4o-mini'),
  }),
});

const { flag, pickFlags } = createAppScope({ flagSchema });

export { flag, pickFlags };

Run evaluations

To run your evaluation suites from your terminal, install the Axiom CLI and use the following commands.
DescriptionCommand
Run all evalsaxiom eval
Run specific eval fileaxiom eval src/evals/ticket-classification.eval.ts
Run evals matching a glob patternaxiom eval "**/*spam*.eval.ts"
Run eval by nameaxiom eval "spam-classification"
List available evals without runningaxiom eval --list

Analyze results in Console

When you run an eval, Axiom AI SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. Axiom enriches the traces with eval.* attributes, allowing you to deeply analyze results in the Axiom Console. The results of evals:
  • Pass/fail status for each test case
  • Scores from each scorer
  • Comparison to baseline (if available)
  • Links to view detailed traces in Axiom
The Console features leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.

Additional configuration options

Custom scorers

A scorer is a function that scores a capability’s output. Scorers receive the input, the generated output, and the expected value, and return a score. The example above uses two custom scorers. Scorers can return metadata alongside the score. You can use the autoevals library instead of custom scorers. autoevals provides prebuilt scorers for common tasks like semantic similarity, factual correctness, and text matching.

Run experiments

Flags let you parameterize your AI behavior (like model choice or prompting strategies) and run experiments with different configurations. They’re type-safe via Zod schemas, and you can override them at runtime. The example above uses the ticketClassification flag to test different language models. Flags have a default value that you can override at runtime in one of the following ways:
  • Override flags directly when you run the eval:
    axiom eval --flag.ticketClassification.model=gpt-4o
    
  • Alternatively, specify the flag overrides in a JSON file.
    experiment.json
    {
      "ticketClassification": {
        "model": "gpt-4o"
      }
    }
    
    And then specify the JSON file as the value of the flags-config parameter when you run the eval:
    axiom eval --flags-config=experiment.json
    

What’s next?

A capability is ready to be deployed when it meets your quality benchmarks. After deployment, the next steps can be the following:
  • Baseline comparisons: Run evals multiple times to track regression over time.
  • Experiment with flags: Test different models or strategies using flag overrides.
  • Advanced scorers: Build custom scorers for domain-specific metrics.
  • CI/CD integration: Add axiom eval to your CI pipeline to catch regressions.
The next step is to monitor your capability’s performance with real-world traffic. To learn more about this step of the AI engineering workflow, see Observe.