- Continuous quality monitoring: Score production outputs in real time without ground truth, using the same Scorer API as offline evaluations
- Fire-and-forget execution:
onlineEvalruns scorers in the background without blocking your response to the user - Per-scorer sampling: Run cheap heuristic scorers on every request and expensive LLM judges on a fraction of traffic
- Traced to the source: Each eval span links to the originating generation span via OTel span links, so you can see exactly what produced a low score
- Closes the loop: Production failures become offline test cases. Online and offline evaluations reinforce each other
Offline evaluations catch regressions before deployment. User feedback gives you direct signal when something breaks in production. But between the test cases you curated and the feedback users volunteer, there's a blind spot: the long tail of production traffic you never anticipated, the format drift you didn't check for, the subtle quality shifts after a model provider update.
Online evaluations fill this gap. Attach scoring functions to live production traffic and get continuous visibility into how your AI capability is performing. No ground truth required. No blocking your response. Automated quality scores stream alongside every trace.
Offline evaluations tell you if you're ready to ship. User feedback tells you what users think. Online evaluations tell you how your capability is actually performing, right now, on requests you never thought to test.
How it works
Import onlineEval and Scorer from the Axiom AI SDK. Call onlineEval inside your withSpan callback with the output you just generated. The call doesn't block your response to the user — scorers execute in the background and results stream to Axiom as OTel spans linked to the originating generation span.
import { withSpan } from 'axiom/ai';
import { onlineEval } from 'axiom/ai/evals/online';
import { Scorer } from 'axiom/ai/evals/scorers';
import { generateText } from 'ai';
import { gpt4oMini } from './lib/model';
const categories = ['support', 'complaint', 'wrong_company', 'spam', 'unknown'];
const validCategoryScorer = Scorer('valid-category', ({ output }: { output: string }) => {
return categories.includes(output.trim().toLowerCase());
});
const result = await withSpan({ capability: 'support-agent', step: 'categorize' }, async () => {
const response = await generateText({
model: gpt4oMini,
messages: [{ role: 'user', content: `Classify this message as: ${categories.join(', ')}.\n\n${userMessage}` }],
});
void onlineEval('categorize-quality', {
capability: 'support-agent',
step: 'categorize',
input: userMessage,
output: response.text,
scorers: [validCategoryScorer],
});
return response.text;
});The void keyword is intentional. onlineEval returns a Promise, but you deliberately don't await it. Your response reaches the user without waiting for scorers to finish. In a long-running server, this fire-and-forget pattern means evaluation never adds latency to your request. If a scorer fails, the failure is recorded as an OTel event on the scorer span, not as an exception in your request handler.
Online evaluations use the same Scorer API as offline evaluations. The key difference is context: online scorers are reference-free. They receive input and output, but no expected value. Return a boolean for pass/fail, a number for graded scoring, or an object with score and metadata for richer telemetry. Scorers that don't depend on ground truth work in both offline and online contexts without modification.
Sampling: control cost without losing signal
LLM-as-judge scorers are powerful but can be expensive at scale. Per-scorer sampling lets you control the proportion of production traffic each scorer evaluates, so you can run cheap heuristic scorers on everything and reserve LLM judges for a fraction of requests.
void onlineEval('categorize-message', {
capability: 'support-agent',
step: 'categorize-message',
input: userMessage,
output: result,
scorers: [
{ scorer: relevanceScorer, sampling: 0.1 },
formatConfidenceScorer,
],
});Set sampling: 0.1 to evaluate roughly 10% of traffic. Set sampling: 0.5 for 50%. Omit it entirely for 100%. You can also pass an async function for conditional sampling: sample more heavily on long inputs where your capability struggles, or on specific user segments where quality matters most.
This means you can build a layered monitoring strategy. Structural checks on every request catch format regressions instantly. Semantic judges on a sample provide deeper quality signal at manageable cost. Adjust the ratios as you learn where your capability is strong and where it's fragile.
Linked to the trace
Each onlineEval call creates a parent eval span with one child span per scorer. All spans link back to the originating generation span via OpenTelemetry span links. When a score drops, you click through to the full trace: the exact input, the model call, any tool executions, the output that scored poorly. You're not guessing what went wrong. You're looking at it.
When you call onlineEval inside withSpan, the active span is auto-detected and linked. For deferred evaluation, where you want to run scorers after withSpan returns, capture span.spanContext() and pass it as links:
let originCtx: SpanContext;
const result = await withSpan(
{ capability: 'demo', step: 'answer' },
async (span) => {
originCtx = span.spanContext();
return await generateText({ model, messages });
},
);
void onlineEval('answer-relevance', {
capability: 'demo',
step: 'answer',
links: originCtx,
input: question,
output: result,
scorers: [{ scorer: relevanceScorer, sampling: 0.5 }],
});The link is the stable relationship. Whether the eval span is a child of the originating span (called inside withSpan) or a sibling with a link (called outside), the connection is preserved in Axiom's trace view.
A real-world example
Here's what online evaluations look like in a production support agent. The capability classifies incoming messages into categories. Two scorers monitor quality: one validates that the output is a known category, the other checks format confidence.
import { withSpan, wrapAISDKModel } from 'axiom/ai';
import { onlineEval } from 'axiom/ai/evals/online';
import { Scorer } from 'axiom/ai/evals/scorers';
import { generateText } from 'ai';
const messageCategories = ['support', 'complaint', 'wrong_company', 'spam', 'unknown'] as const;
const validCategoryScorer = Scorer(
'valid-category',
({ output }: { output: string }) => {
const isValid = messageCategories.includes(output as any);
return {
score: isValid,
metadata: { category: output, validCategories: messageCategories },
};
},
);
const formatConfidenceScorer = Scorer(
'format-confidence',
({ output }: { output: string }) => {
const trimmed = output.trim().toLowerCase();
const isSingleWord = !trimmed.includes(' ');
const isLowercase = trimmed === output;
const isClean = /^[a-z_]+$/.test(trimmed);
return {
score: (isSingleWord ? 0.4 : 0) + (isLowercase ? 0.3 : 0) + (isClean ? 0.3 : 0),
metadata: { isSingleWord, isLowercase, isClean },
};
},
);
export async function categorizeMessage(userMessage: string) {
return await withSpan(
{ capability: 'support-agent', step: 'categorize-message' },
async () => {
const response = await generateText({
model,
messages: [{ role: 'system', content: classificationPrompt(userMessage) }],
});
const result = parseCategory(response.text);
void onlineEval('categorize-message', {
capability: 'support-agent',
step: 'categorize-message',
input: userMessage,
output: result,
scorers: [
{ scorer: validCategoryScorer, sampling: 0.1 },
formatConfidenceScorer,
],
});
return result;
},
);
}The validCategoryScorer returns a boolean with metadata, automatically normalized to 1/0. The formatConfidenceScorer returns a weighted numeric score. Both attach metadata that appears on the OTel span, so when you investigate a low score, you see exactly which checks failed and why.
In production, this runs on every request without any impact on response latency. The format confidence scorer evaluates every request; the category validation samples 10% of traffic.
Analyze and iterate
Online evaluation results appear in the Axiom Console alongside your production traces. Filter by eval.tags: ["online"] to focus on online eval spans. Track score trends over time. Compare distributions before and after deployments. When a scorer's average drops, drill into the failing spans to see the input, output, and full trace that produced them.
For example, the query below shows a per-scorer, per-capability breakdown of average score, pass rate, and failure count over time.
['genai-traces']
| where ['attributes.gen_ai.operation.name'] == "eval.score"
| extend
scorer = ['attributes.eval.score.name'],
score = todouble(['attributes.eval.score.value']),
capability = ['attributes.eval.capability.name'],
step = ['attributes.eval.step.name']
| summarize
avg_score = round(avg(score), 2),
pass_rate = round(countif(score >= 0.5) * 100.0 / count(), 1),
fail_count = countif(score < 0.5),
total = count()
by bin_auto(_time), scorer, capability
| order by _time descThe real value comes from what you do with the data. When you find a production trace where your capability scored poorly, you have a concrete failure to investigate. Document what should have happened, add the failing input and expected output to your offline test collection, fix the issue, and verify with an offline run before redeploying.
This is where online and offline evaluations reinforce each other. Online evals catch problems you didn't anticipate. Offline evals let you systematically fix and prevent them from recurring. Your test coverage grows from real production failures rather than hypothetical scenarios you imagined during development.
The bigger picture
We started with observability for AI engineering: rich telemetry for prompts, completions, tool calls, and costs. Then offline evaluations: systematic testing against curated collections before deployment. Then user feedback: direct signal from end users when something breaks. Now online evaluations: continuous scoring of production traffic in real time.
Each piece addresses a different blind spot. Offline evals answer "is this ready to ship?" User feedback answers "what do users think?" Online evals answer "how is this actually performing, right now, on traffic I never thought to test?"
Together, they close the loop. Production insights strengthen your test coverage. Your evaluation results inform what to ship. The improvement cycle gets shorter and more evidence-driven with each iteration.
Get started
Online evaluations are available in the Axiom AI SDK and Console now.
- Read the documentation for setup, scorer patterns, and sampling strategies
- Analyze results in the Console
- Scorers work in both offline and online contexts without modification
The teams shipping the most reliable AI capabilities aren't just testing before deployment or waiting for user feedback. They're scoring production traffic continuously. That practice is now native to Axiom.