AI engineering lifecycle
The concepts in AI engineering are best understood within the context of the development lifecycle. While AI capabilities can become highly sophisticated, they typically start simple and evolve through a disciplined, iterative process:Prototype a capability
Development starts by defining a task and prototyping a with a prompt to solve it.
Evaluate with ground truth
The prototype is then tested against a of reference examples (so called “”) to measure its quality and effectiveness using . This process is known as an .
Observe in production
Once a capability meets quality benchmarks, it’s deployed. In production, scorers can be applied to live traffic () to monitor performance and cost in real-time.
AI engineering terms
Capability
A generative AI capability is a system that uses large language models to perform a specific task by transforming inputs into desired outputs. Capabilities exist on a spectrum of complexity, ranging from simple to sophisticated architectures:- Single-turn model interactions: A single prompt and response, such as classifying a support ticket’s intent or summarizing a document.
- Workflows: Multi-step processes where each step’s output feeds into the next, such as research → analysis → report generation.
- Single-agent: An agent that can reasons and make decisions to accomplish a goal, such as a customer support agent that can search documentation, check order status, and draft responses.
- Multi-agent: Multiple specialized agents collaborating to solve complex problems, such as software engineering through architectural planning, coding, testing, and review.
Collection
A collection is a curated set of reference records used for development, testing, and evaluation of a capability. Collections serve as the test cases for prompt engineering.Collection record
Collection records are the individual input-output pairs within a collection. Each record consists of an input and its corresponding expected output (ground truth).Ground truth
Ground truth is the validated, expert-approved correct output for a given input. It represents the gold standard that the AI capability should aspire to match.Scorer
A scorer is a function that evaluates a capability’s output, returning a score that indicates quality or correctness. There are two types of scorers: reference-based scorers and reference-free scorers.Reference-based scorer
A reference-based scorer depends on an expected value (ground truth) to evaluate a capability’s output. It compares the generated output against domain expert knowledge to determine correctness or similarity. Examples include:- Exact match scorer: Checks if the output exactly matches the expected value.
- Similarity scorer: Measures how similar the output is to the expected value when an exact match isn’t required.
Reference-free scorer
A reference-free scorer evaluates a capability’s output without needing an expected value. It uses an LLM or other criteria to assess output quality based on general standards rather than comparison to ground truth. Examples include:- Toxicity scorer: Uses an LLM to assess whether the output is offensive, harmful, or could upset the recipient.
- Coherence scorer: Evaluates whether the output is logically consistent and well-structured.
- Relevance scorer: Assesses whether the output appropriately addresses the input query.