Skip to main content
This page defines the core terms used in the AI engineering workflow. Understanding these concepts is the first step toward building robust and reliable generative AI capabilities.

AI engineering lifecycle

The concepts in AI engineering are best understood within the context of the development lifecycle. While AI capabilities can become highly sophisticated, they typically start simple and evolve through a disciplined, iterative process:
1

Prototype a capability

Development starts by defining a task and prototyping a with a prompt to solve it.
2

Evaluate with ground truth

The prototype is then tested against a of reference examples (so called “”) to measure its quality and effectiveness using . This process is known as an .
3

Observe in production

Once a capability meets quality benchmarks, it’s deployed. In production, scorers can be applied to live traffic () to monitor performance and cost in real-time.
4

Iterate with new insights

Insights from production monitoring reveal edge cases and opportunities for improvement. These new examples are used to refine the capability, expand the ground truth collection, and begin the cycle anew.

AI engineering terms

Capability

A generative AI capability is a system that uses large language models to perform a specific task by transforming inputs into desired outputs. Capabilities exist on a spectrum of complexity, ranging from simple to sophisticated architectures:
  • Single-turn model interactions: A single prompt and response, such as classifying a support ticket’s intent or summarizing a document.
  • Workflows: Multi-step processes where each step’s output feeds into the next, such as research → analysis → report generation.
  • Single-agent: An agent that can reasons and make decisions to accomplish a goal, such as a customer support agent that can search documentation, check order status, and draft responses.
  • Multi-agent: Multiple specialized agents collaborating to solve complex problems, such as software engineering through architectural planning, coding, testing, and review.

Collection

A collection is a curated set of reference records used for development, testing, and evaluation of a capability. Collections serve as the test cases for prompt engineering.

Collection record

Collection records are the individual input-output pairs within a collection. Each record consists of an input and its corresponding expected output (ground truth).

Ground truth

Ground truth is the validated, expert-approved correct output for a given input. It represents the gold standard that the AI capability should aspire to match.

Scorer

A scorer is a function that evaluates a capability’s output, returning a score that indicates quality or correctness. There are two types of scorers: reference-based scorers and reference-free scorers.

Reference-based scorer

A reference-based scorer depends on an expected value (ground truth) to evaluate a capability’s output. It compares the generated output against domain expert knowledge to determine correctness or similarity. Examples include:
  • Exact match scorer: Checks if the output exactly matches the expected value.
  • Similarity scorer: Measures how similar the output is to the expected value when an exact match isn’t required.
Reference-based scorers require domain expert knowledge and can be used in offline evaluation and backtesting (if historical traces have been reviewed by domain experts). They can’t be used in online evaluation because live production data lacks expected values.

Reference-free scorer

A reference-free scorer evaluates a capability’s output without needing an expected value. It uses an LLM or other criteria to assess output quality based on general standards rather than comparison to ground truth. Examples include:
  • Toxicity scorer: Uses an LLM to assess whether the output is offensive, harmful, or could upset the recipient.
  • Coherence scorer: Evaluates whether the output is logically consistent and well-structured.
  • Relevance scorer: Assesses whether the output appropriately addresses the input query.
Reference-free scorers can be used in all evaluation contexts: offline evaluation, online evaluation, and backtesting. They’re the only type of scorer available for online evaluation because live production data hasn’t been reviewed by domain experts.

Evaluation or “eval”

An evaluation, or eval, is the process of testing a capability’s performance using one or more scorers. There are three ways to run evaluations: offline evaluation, online evaluation, and backtesting.

Offline evaluation

Offline evaluation is the process of testing a capability against a curated set of test cases (collection records) before deployment. You curate what good looks like with domain expert knowledge and provide expected values for your test cases. Offline evaluation can use both reference-based scorers (which compare output to expected values) and reference-free scorers (which assess output quality without needing expected values).

Online evaluation

Online evaluation is the process of applying scorers to a capability’s live production traffic in real-time, just after conversations or interactions finish. Online evaluation runs on a sample of traces to provide feedback on performance degradation, cost, and quality drift. Because online evaluation happens on live data without domain expert review, it can only use reference-free scorers. Reference-based scorers require expected values that don’t exist for live production traffic.

Backtesting

Backtesting is a form of evaluation that runs a capability against historical production traces from a specified time period. Unlike online evaluation which runs in real-time as conversations happen, backtesting performs batch evaluation over many stored conversations at once. This allows you to compare how a new version of your capability performs on previous real-world conversations compared to the original version. Backtesting can always use reference-free scorers. It can also use reference-based scorers if domain experts have reviewed the historical traces and provided ground truth expected values.

Flag

A flag is a configuration parameter that controls how your AI capability behaves. Flags let you parameterize aspects like model choice, tool availability, prompting strategies, or retrieval approaches. By defining flags, you can run experiments to compare different configurations and systematically determine which approach performs best.

Experiment

An experiment is an evaluation run with a specific set of flag values. By running multiple experiments with different flag configurations, you can compare performance across different models, prompts, or strategies to find the optimal setup for your capability.

Reviews

Reviews are expert-provided observations, labels, or corrections added to production traces or evaluation results. Domain experts review AI capability runs and document what went wrong, what should have happened differently, or categorize failure modes. These reviews help identify patterns in capability failures, validate scorer accuracy, and create new test cases for collections.

User feedback

User feedback is direct signal from end users about AI capability performance, typically collected through ratings (thumbs up/down, stars) or text comments. Feedback events are associated with traces to provide context about both system behavior and user perception. Aggregated feedback reveals quality trends, helps prioritize improvements, and surfaces issues that might not appear in evaluations.

What’s next?

Now that you understand the core concepts, get started with the Quickstart or dive into Evaluate to learn about systematic testing.