Catch what tests miss: Online evaluations for AI capabilities

Continuous quality monitoring: Score production outputs in real time without ground truth, using the same Scorer API as offline evaluations
Fire-and-forget execution: onlineEval runs scorers in the background without blocking your response to the user
Per-scorer sampling: Run cheap heuristic scorers on every request and expensive LLM judges on a fraction of traffic
Traced to the source: Each eval span links to the originating generation span via OTel span links, so you can see exactly what produced a low score
Closes the loop: Production failures become offline test cases. Online and offline evaluations reinforce each other

Offline evaluations catch regressions before deployment. User feedback gives you direct signal when something breaks in production. But between the test cases you curated and the feedback users volunteer, there's a blind spot: the long tail of production traffic you never anticipated, the format drift you didn't check for, the subtle quality shifts after a model provider update.

Online evaluations fill this gap. Attach scoring functions to live production traffic and get continuous visibility into how your AI capability is performing. No ground truth required. No blocking your response. Automated quality scores stream alongside every trace.

Offline evaluations tell you if you're ready to ship. User feedback tells you what users think. Online evaluations tell you how your capability is actually performing, right now, on requests you never thought to test.

How it works

Import onlineEval and Scorer from the Axiom AI SDK. Call onlineEval inside your withSpan callback with the output you just generated. The call doesn't block your response to the user — scorers execute in the background and results stream to Axiom as OTel spans linked to the originating generation span.

import { withSpan } from 'axiom/ai';
import { onlineEval } from 'axiom/ai/evals/online';
import { Scorer } from 'axiom/ai/evals/scorers';
import { generateText } from 'ai';
import { gpt4oMini } from './lib/model';

const categories = ['support', 'complaint', 'wrong_company', 'spam', 'unknown'];

const validCategoryScorer = Scorer('valid-category', ({ output }: { output: string }) => {
  return categories.includes(output.trim().toLowerCase());
});

const result = await withSpan({ capability: 'support-agent', step: 'categorize' }, async () => {
  const response = await generateText({
    model: gpt4oMini,
    messages: [{ role: 'user', content: `Classify this message as: ${categories.join(', ')}.\n\n${userMessage}` }],
  });

  void onlineEval('categorize-quality', {
    capability: 'support-agent',
    step: 'categorize',
    input: userMessage,
    output: response.text,
    scorers: [validCategoryScorer],
  });

  return response.text;
});

The void keyword is intentional. onlineEval returns a Promise, but you deliberately don't await it. Your response reaches the user without waiting for scorers to finish. In a long-running server, this fire-and-forget pattern means evaluation never adds latency to your request. If a scorer fails, the failure is recorded as an OTel event on the scorer span, not as an exception in your request handler.

Online evaluations use the same Scorer API as offline evaluations. The key difference is context: online scorers are reference-free. They receive input and output, but no expected value. Return a boolean for pass/fail, a number for graded scoring, or an object with score and metadata for richer telemetry. Scorers that don't depend on ground truth work in both offline and online contexts without modification.

Sampling: control cost without losing signal

LLM-as-judge scorers are powerful but can be expensive at scale. Per-scorer sampling lets you control the proportion of production traffic each scorer evaluates, so you can run cheap heuristic scorers on everything and reserve LLM judges for a fraction of requests.

void onlineEval('categorize-message', {
  capability: 'support-agent',
  step: 'categorize-message',
  input: userMessage,
  output: result,
  scorers: [
    { scorer: relevanceScorer, sampling: 0.1 },
    formatConfidenceScorer,
  ],
});

Set sampling: 0.1 to evaluate roughly 10% of traffic. Set sampling: 0.5 for 50%. Omit it entirely for 100%. You can also pass an async function for conditional sampling: sample more heavily on long inputs where your capability struggles, or on specific user segments where quality matters most.

This means you can build a layered monitoring strategy. Structural checks on every request catch format regressions instantly. Semantic judges on a sample provide deeper quality signal at manageable cost. Adjust the ratios as you learn where your capability is strong and where it's fragile.

Linked to the trace

Each onlineEval call creates a parent eval span with one child span per scorer. All spans link back to the originating generation span via OpenTelemetry span links. When a score drops, you click through to the full trace: the exact input, the model call, any tool executions, the output that scored poorly. You're not guessing what went wrong. You're looking at it.

When you call onlineEval inside withSpan, the active span is auto-detected and linked. For deferred evaluation, where you want to run scorers after withSpan returns, capture span.spanContext() and pass it as links:

let originCtx: SpanContext;
const result = await withSpan(
  { capability: 'demo', step: 'answer' },
  async (span) => {
    originCtx = span.spanContext();
    return await generateText({ model, messages });
  },
);

void onlineEval('answer-relevance', {
  capability: 'demo',
  step: 'answer',
  links: originCtx,
  input: question,
  output: result,
  scorers: [{ scorer: relevanceScorer, sampling: 0.5 }],
});

The link is the stable relationship. Whether the eval span is a child of the originating span (called inside withSpan) or a sibling with a link (called outside), the connection is preserved in Axiom's trace view.

A real-world example

Here's what online evaluations look like in a production support agent. The capability classifies incoming messages into categories. Two scorers monitor quality: one validates that the output is a known category, the other checks format confidence.

import { withSpan, wrapAISDKModel } from 'axiom/ai';
import { onlineEval } from 'axiom/ai/evals/online';
import { Scorer } from 'axiom/ai/evals/scorers';
import { generateText } from 'ai';

const messageCategories = ['support', 'complaint', 'wrong_company', 'spam', 'unknown'] as const;

const validCategoryScorer = Scorer(
  'valid-category',
  ({ output }: { output: string }) => {
    const isValid = messageCategories.includes(output as any);
    return {
      score: isValid,
      metadata: { category: output, validCategories: messageCategories },
    };
  },
);

const formatConfidenceScorer = Scorer(
  'format-confidence',
  ({ output }: { output: string }) => {
    const trimmed = output.trim().toLowerCase();
    const isSingleWord = !trimmed.includes(' ');
    const isLowercase = trimmed === output;
    const isClean = /^[a-z_]+$/.test(trimmed);

    return {
      score: (isSingleWord ? 0.4 : 0) + (isLowercase ? 0.3 : 0) + (isClean ? 0.3 : 0),
      metadata: { isSingleWord, isLowercase, isClean },
    };
  },
);

export async function categorizeMessage(userMessage: string) {
  return await withSpan(
    { capability: 'support-agent', step: 'categorize-message' },
    async () => {
      const response = await generateText({
        model,
        messages: [{ role: 'system', content: classificationPrompt(userMessage) }],
      });

      const result = parseCategory(response.text);

      void onlineEval('categorize-message', {
        capability: 'support-agent',
        step: 'categorize-message',
        input: userMessage,
        output: result,
        scorers: [
          { scorer: validCategoryScorer, sampling: 0.1 },
          formatConfidenceScorer,
        ],
      });

      return result;
    },
  );
}

The validCategoryScorer returns a boolean with metadata, automatically normalized to 1/0. The formatConfidenceScorer returns a weighted numeric score. Both attach metadata that appears on the OTel span, so when you investigate a low score, you see exactly which checks failed and why.

In production, this runs on every request without any impact on response latency. The format confidence scorer evaluates every request; the category validation samples 10% of traffic.

Analyze and iterate

Online evaluation results appear in the Axiom Console alongside your production traces. Filter by eval.tags: ["online"] to focus on online eval spans. Track score trends over time. Compare distributions before and after deployments. When a scorer's average drops, drill into the failing spans to see the input, output, and full trace that produced them.

For example, the query below shows a per-scorer, per-capability breakdown of average score, pass rate, and failure count over time.

['genai-traces']
| where ['attributes.gen_ai.operation.name'] == "eval.score"
| extend
    scorer = ['attributes.eval.score.name'],
    score = todouble(['attributes.eval.score.value']),
    capability = ['attributes.eval.capability.name'],
    step = ['attributes.eval.step.name']
| summarize
    avg_score = round(avg(score), 2),
    pass_rate = round(countif(score >= 0.5) * 100.0 / count(), 1),
    fail_count = countif(score < 0.5),
    total = count()
    by bin_auto(_time), scorer, capability
| order by _time desc

Analyze online evaluation results in Axiom Console.

The real value comes from what you do with the data. When you find a production trace where your capability scored poorly, you have a concrete failure to investigate. Document what should have happened, add the failing input and expected output to your offline test collection, fix the issue, and verify with an offline run before redeploying.

This is where online and offline evaluations reinforce each other. Online evals catch problems you didn't anticipate. Offline evals let you systematically fix and prevent them from recurring. Your test coverage grows from real production failures rather than hypothetical scenarios you imagined during development.

The bigger picture

We started with observability for AI engineering: rich telemetry for prompts, completions, tool calls, and costs. Then offline evaluations: systematic testing against curated collections before deployment. Then user feedback: direct signal from end users when something breaks. Now online evaluations: continuous scoring of production traffic in real time.

Each piece addresses a different blind spot. Offline evals answer "is this ready to ship?" User feedback answers "what do users think?" Online evals answer "how is this actually performing, right now, on traffic I never thought to test?"

Together, they close the loop. Production insights strengthen your test coverage. Your evaluation results inform what to ship. The improvement cycle gets shorter and more evidence-driven with each iteration.

Get started

Online evaluations are available in the Axiom AI SDK and Console now.

Read the documentation for setup, scorer patterns, and sampling strategies
Analyze results in the Console
Scorers work in both offline and online contexts without modification

The teams shipping the most reliable AI capabilities aren't just testing before deployment or waiting for user feedback. They're scoring production traffic continuously. That practice is now native to Axiom.

#LAUNCHEDStop guessing. Ship AI products with confidenceLearn more→

#PLATFORM

Observability

Distributed traces

Volumetric logging

Application performance monitoring

Infrastructure monitoring

AI Engineering

AI workflow tracing

AI SDK & telemetry

Long‑term active retention

Evaluation & experimentation

#LATEST

Latest from the blog

#SIGNALS

Features

Logs

Traces

Metrics

AI

#ARCHITECTURE

#TECHNOLOGIES

Technologies

OpenTelemetry

Events API

Vercel & AI SDK

Cloudflare

#INGEST_FROM_ANYWHERE

#CHANGELOG

See what’s new at Axiom

#GET_STARTED

Documentation

Axiom Playground

Axiom CLI

Support

#COMPANY

Blog

Changelog

About us

Careers

#NEWS

From burden to asset: reimagining logs at scale

Catch what tests miss: Online evaluations for AI capabilities

How it works

Sampling: control cost without losing signal

Linked to the trace

A real-world example

Analyze and iterate

The bigger picture

Get started

More posts

Teaching AI to speak Splunk, then proving it works

Read more→

Close the loop: User feedback for AI capabilities

Read more→

2025 recap: Reflections on building data infrastructure for the AI era

Read more→

Get started with Axiom