> ## Documentation Index
> Fetch the complete documentation index at: https://axiom.co/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Write evaluations

> Learn how to create offline evaluation functions with collections, tasks, and scorers.

export const definitions = {
  'Capability': 'A generative AI capability is a system that uses large language models to perform a specific task.',
  'Collection': 'A curated set of reference records that are used for the development, testing, and evaluation of a capability.',
  'Console': "Axiom’s intuitive web app built for exploration, visualization, and monitoring of your data.",
  'Eval': 'The process of testing a capability against a collection of ground truth references using one or more graders.',
  'GroundTruth': 'The validated, expert-approved correct output for a given input.',
  'EventDB': "Axiom’s robust, cost-effective, and scalable datastore specifically optimized for timestamped event data.",
  'OnlineEval': 'The process of applying a grader to a capability’s live production traffic.',
  'Scorer': 'A function that measures a capability’s output.'
};

An offline evaluation is a test suite for your AI capability. It runs your capability against a <Tooltip tip={definitions.Collection}>collection</Tooltip> of test cases and scores the results using <Tooltip tip={definitions.Scorer}>scorers</Tooltip>. This page explains how to write offline evaluation functions using Axiom's `Eval` API.

<Info>
  This page covers writing offline evaluations. For online evaluations, see [Online evaluations](/ai-engineering/evaluate/online-evaluations/write-run-evaluations).
</Info>

## Prerequisites

* Follow the procedure in [Quickstart](/ai-engineering/quickstart) to set up Axiom AI SDK in your TypeScript project.
* For offline evaluations, use an API token with permissions to ingest **and query** your dataset. Other AI engineering workflows only require a token with ingest permissions.
* Wrap your AI model with `wrapAISDKModel` for automatic tracing. See [Instrumentation with Axiom AI SDK](/ai-engineering/observe/axiom-ai-sdk-instrumentation) for details.

Instead of using environment variables explained in the [Quickstart](/ai-engineering/quickstart), you can authenticate using OAuth instead.

<Accordion title="Authenticate with OAuth">
  The Axiom AI SDK includes a CLI for authenticating and running offline evaluations. Authenticate so that evaluation runs are recorded in Axiom and attributed to your user account.

  ### Login

  ```bash theme={null}
  npx axiom auth login
  ```

  This opens your browser and prompts you to authorize the CLI with your Axiom account. Once authorized, the CLI stores your credentials locally.

  ### Check authentication status

  ```bash theme={null}
  npx axiom auth status
  ```

  ### Switch organizations

  If you belong to multiple Axiom organizations:

  ```bash theme={null}
  npx axiom auth switch
  ```

  ### Logout

  ```bash theme={null}
  npx axiom auth logout
  ```
</Accordion>

## Anatomy of an offline evaluation

The `Eval` function defines a complete test suite for your capability. Here’s the basic structure:

```ts theme={null}
import { Eval } from 'axiom/ai/evals';
import { Scorer } from 'axiom/ai/scorers';

Eval('evaluation-name', {
  data: [/* test cases */],
  task: async ({ input }) => {/* run capability */},
  scorers: [/* scoring functions */],
  metadata: {/* optional metadata */},
});
```

### Key parameters

* **`data`**: An array of test cases, or a function that returns an array of test cases. Each test case has an `input` (what you send to your capability) and an `expected` output (the ground truth).
* **`task`**: An async function that executes your capability for a given input and returns the output.
* **`scorers`**: An array of scorer functions that evaluate the output against the expected result.
* **`metadata`**: Optional metadata like a description or tags.

## Create collections

The `data` parameter defines your collection of test cases. Start with a small set of examples and grow it over time as you discover edge cases.

### Inline collections

For small collections, define test cases directly in the offline evaluation:

```ts theme={null}
Eval('classify-sentiment', {
  data: [
    {
      input: { text: 'I love this product!' },
      expected: { sentiment: 'positive' },
    },
    {
      input: { text: 'This is terrible.' },
      expected: { sentiment: 'negative' },
    },
    {
      input: { text: 'It works as expected.' },
      expected: { sentiment: 'neutral' },
    },
  ],
  // ... rest of eval
});
```

### External collections

For larger collections, load test cases from external files or databases:

```ts theme={null}
import { readFile } from 'fs/promises';

Eval('classify-sentiment', {
  data: async () => {
    const content = await readFile('./test-cases/sentiment.json', 'utf-8');
    return JSON.parse(content);
  },
  // ... rest of eval
});
```

<Tip>
  We recommend storing collections in version control alongside your code. This makes it easy to track how your test suite evolves and ensures evaluations are reproducible.
</Tip>

## Define tasks

The `task` function executes your AI capability for each test case. It receives the `input` from the test case and should return the output your capability produces.

```ts theme={null}
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';

async function classifySentiment(text: string) {
  const result = await generateText({
    model: wrapAISDKModel(openai('gpt-4o-mini')),
    prompt: `Classify the sentiment of this text as positive, negative, or neutral: "${text}"`,
  });
  
  return { sentiment: result.text };
}

Eval('classify-sentiment', {
  data: [/* ... */],
  task: async ({ input }) => {
    return await classifySentiment(input.text);
  },
  scorers: [/* ... */],
});
```

<Note>
  The task function should generally be the same code you use in your actual capability. This ensures your evaluations accurately reflect real-world behavior.
</Note>

## Create scorers

Scorers evaluate your capability's output. In offline evaluations, scorers receive `input`, `output`, and `expected` (ground truth), and return a score. For the full `Scorer` API reference including return types, patterns, and third-party integrations, see [Scorers](/ai-engineering/evaluate/scorers).

Here's a quick example of an offline scorer that compares output to expected values:

```ts theme={null}
import { Scorer } from 'axiom/ai/scorers';

const ExactMatchScorer = Scorer(
  'exact-match',
  ({ output, expected }) => {
    return output.sentiment === expected.sentiment ? true : false;
  }
);
```

## Complete example

Here's a complete evaluation for a support ticket classification system:

```ts src/lib/capabilities/classify-ticket/evaluations/spam-classification.eval.ts expandable theme={null}
import { Eval } from 'axiom/ai/evals';
import { Scorer } from 'axiom/ai/scorers';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';
import { z } from 'zod';

// The capability function
async function classifyTicket({ 
  subject, 
  content 
}: { 
  subject?: string; 
  content: string 
}) {
  const result = await generateObject({
    model: wrapAISDKModel(openai('gpt-4o-mini')),
    messages: [
      {
        role: 'system',
        content: `You are a customer support engineer. Classify tickets as: 
        spam, question, feature_request, or bug_report.`,
      },
      {
        role: 'user',
        content: subject ? `Subject: ${subject}\n\n${content}` : content,
      },
    ],
    schema: z.object({
      category: z.enum(['spam', 'question', 'feature_request', 'bug_report']),
      confidence: z.number().min(0).max(1),
    }),
  });

  return result.object;
}

// Custom scorer for category matching
const CategoryScorer = Scorer(
  'category-match',
  ({ output, expected }) => {
    return output.category === expected.category ? true : false;
  }
);

// Custom scorer for high-confidence predictions
const ConfidenceScorer = Scorer(
  'high-confidence',
  ({ output }) => {
    return output.confidence >= 0.8 ? true : false;
  }
);

// Define the evaluation
Eval('spam-classification', {
  data: [
    {
      input: {
        subject: "Congratulations! You've Won!",
        content: 'Claim your $500 gift card now!',
      },
      expected: {
        category: 'spam',
      },
    },
    {
      input: {
        subject: 'How do I reset my password?',
        content: 'I forgot my password and need help resetting it.',
      },
      expected: {
        category: 'question',
      },
    },
    {
      input: {
        subject: 'Feature request: Dark mode',
        content: 'Would love to see a dark mode option in the app.',
      },
      expected: {
        category: 'feature_request',
      },
    },
    {
      input: {
        subject: 'App crashes on startup',
        content: 'The app crashes immediately when I try to open it.',
      },
      expected: {
        category: 'bug_report',
      },
    },
  ],
  
  task: async ({ input }) => {
    return await classifyTicket(input);
  },
  
  scorers: [CategoryScorer, ConfidenceScorer],
  
  metadata: {
    description: 'Classify support tickets into categories',
  },
});
```

## File naming conventions

Name your evaluation files with the `.eval.ts` extension so they're automatically discovered by the Axiom CLI:

```
src/
└── lib/
    └── capabilities/
        └── classify-ticket/
            └── evaluations/
                ├── spam-classification.eval.ts
                ├── category-accuracy.eval.ts
                └── edge-cases.eval.ts
```

The CLI will find all files matching `**/*.eval.{ts,js,mts,mjs,cts,cjs}` based on your `axiom.config.ts` configuration.

## What's next?

* To parameterize your capabilities and run experiments, see [Flags and experiments](/ai-engineering/evaluate/flags-experiments).
* To run offline evaluations using the CLI, see [Run offline evaluations](/ai-engineering/evaluate/run-evaluations).
