Score AI outputs on live production traffic using lightweight scorers and sampling.
Online evaluations let you score your AI capability’s outputs on live production traffic. Unlike offline evaluations that run against a fixed collection of test cases with expected values, online evaluations are reference-free.Use online evaluations to monitor quality in production: catch format regressions, run heuristic checks, or sample traffic for LLM-as-judge scoring without affecting your capability’s response.
Online evaluations never throw errors into your app’s code. Scorer failures are recorded on the eval span as OTel events, so a broken scorer won’t affect your capability’s response.
Online evaluations use the same Scorer API as offline evaluations. The key difference is that online scorers are reference-free: they receive input and output but no expected value. For the full Scorer API reference including return types, patterns, and LLM-as-judge examples, see Scorers.Here’s a quick example of an online scorer that validates output format:
Use sampling to control the percentage of production traffic that gets evaluated. You can set different sampling for each scorer. This is useful for expensive scorers like LLM judges while letting cheap heuristic scorers run on every request.Wrap a scorer in { scorer, sampling } to control the percentage of production traffic it evaluates. You can mix sampled and unsampled scorers in the same call. Scorers without a sampling wrapper run on every request.
sampling value
Behavior
not set (default)
Evaluate every request
0.5
Evaluate ~50% of requests
0.1
Evaluate ~10% of requests
0.0
Never evaluate. The scorer is skipped and its key is omitted from the result record
Copy
Ask AI
void onlineEval('categorize-message', { capability: 'support-agent', step: 'categorize-message', input: userMessage, output: result, scorers: [ // Wrap each scorer with its own sampling rate { scorer: validCategoryScorer, sampling: 0.1 }, // Evaluate 10% of traffic formatConfidenceScorer // Evaluate every request ],});
Additionally, you can set the sampling value to a synchronous or asynchronous function that receives { input, output } and returns a Boolean (or Promise<boolean>) for conditional sampling logic. This is useful when the sampling decision depends on an async lookup such as a feature flag service.