Scorers are functions that measure your AI capability’s output. They receive the inputs and outputs of a capability run, and return a score. The same Scorer API works in both offline and online evaluations.
The key difference between the two contexts is what the scorer receives:
- Offline scorers receive
input, output, and expected (ground truth from your test collection).
- Online scorers are reference-free. They receive
input and output without an expected value.
Because the API is the same, you can reuse scorers across both contexts. A scorer you write for offline evaluations works in online evaluations as long as it doesn’t depend on expected.
Create scorers
Create scorers using the Scorer wrapper. A scorer takes a name and a scoring function:
import { Scorer } from 'axiom/ai/scorers';
const MyScorer = Scorer(
'my-scorer',
({ input, output }) => {
// Return a boolean, a number (0-1), or { score, metadata }
}
);
Return types
Scorers can return three types of values:
Boolean
Return true or false for simple pass/fail checks. The SDK converts booleans to 1 (pass) or 0 (fail) and marks the score as boolean in telemetry.
const isKnownCategory = Scorer(
'is-known-category',
({ output }: { output: string }) => {
return ['support', 'complaint', 'spam', 'unknown'].includes(output);
},
);
Numeric
Return a number between 0 and 1 for graded scoring:
const formatConfidence = Scorer(
'format-confidence',
({ output }: { output: string }) => {
const trimmed = output.trim().toLowerCase();
const isSingleWord = !trimmed.includes(' ');
const isClean = /^[a-z_]+$/.test(trimmed);
return (isSingleWord ? 0.5 : 0) + (isClean ? 0.5 : 0);
},
);
Return an object with score and metadata to attach additional context to the eval span:
const validCategory = Scorer(
'valid-category',
({ output }: { output: string }) => {
const validCategories = ['support', 'complaint', 'spam', 'unknown'];
return {
score: validCategories.includes(output),
metadata: {
category: output,
validCategories,
},
};
},
);
Scorer patterns
Exact match (offline)
Compare the output directly against the expected value. This pattern only works in offline evaluations where ground truth is available.
const ExactMatchScorer = Scorer(
'exact-match',
({ output, expected }) => {
return output.sentiment === expected.sentiment ? true : false;
}
);
Heuristic checks
Validate output structure or format without ground truth. These scorers work in both offline and online evaluations.
const formatScorer = Scorer('format', ({ output }: { output: string }) => {
const trimmed = output.trim();
return /[.!?]$/.test(trimmed) && !trimmed.includes('\n') && trimmed.length <= 200;
});
LLM-as-judge
Use a second model to evaluate the output. Async scorers are useful in both contexts, especially in online evaluations where you don’t have ground truth and need semantic quality assessment.
import { generateObject } from 'ai';
import { z } from 'zod';
const relevanceScorer = Scorer(
'relevance',
async ({ input, output }: { input: string; output: string }) => {
const result = await generateObject({
model: judgeModel,
schema: z.object({
relevant: z.boolean().describe('Whether the response answers the question'),
}),
system: 'You evaluate if an AI response answers the user question.',
prompt: `Question: ${input}\n\nResponse: ${output}`,
});
return result.object.relevant;
},
);
LLM judge scorers add latency and cost per evaluation. In online evaluations, use sampling to control how often they run.
Use autoevals
The autoevals library provides prebuilt scorers for common tasks:
import { Scorer } from 'axiom/ai/scorers';
import { Levenshtein, FactualityScorer } from 'autoevals';
const LevenshteinScorer = Scorer(
'levenshtein',
({ output, expected }) => {
return Levenshtein({ output: output.text, expected: expected.text });
}
);
const FactualityCheck = Scorer(
'factuality',
async ({ output, expected }) => {
return await FactualityScorer({
output: output.text,
expected: expected.text,
});
}
);
Use multiple scorers to evaluate different aspects of your capability. For example, check both exact accuracy and semantic similarity to get a complete picture of performance.
Telemetry
Each scorer produces an OTel span with the following attributes:
| Attribute | Description |
|---|
gen_ai.operation.name | Always eval.score |
eval.name | The eval name |
eval.score.name | The scorer name |
eval.score.value | The numeric score (0-1) |
eval.score.metadata | JSON string of scorer metadata. Includes eval.score.is_boolean: true when the scorer returned a boolean. |
eval.capability.name | The capability being evaluated |
eval.step.name | The step within the capability (when set) |
eval.tags | ["online"] for online evaluations |
What’s next?