This page defines the core terms used in the Rudder workflow. Understanding these concepts is the first step toward building robust and reliable generative AI capabilities.

Rudder lifecycle

The concepts in Rudder are best understood within the context of the development lifecycle. While AI capabilities can become highly sophisticated, they typically start simple and evolve through a disciplined, iterative process:

1

Prototype a capability

Development starts by defining a task and prototyping a with a prompt to solve it.

2

Evaluate with ground truth

The prototype is then tested against a of reference examples (so called “”) to measure its quality and effectiveness using . This process is known as an .

3

Observe in production

Once a capability meets quality benchmarks, it’s deployed. In production, graders can be applied to live traffic () to monitor performance and cost in real-time.

4

Iterate with new insights

Insights from production monitoring reveal edge cases and opportunities for improvement. These new examples are used to refine the capability, expand the ground truth collection, and begin the cycle anew.

Rudder terms

Capability

A generative AI capability is a system that uses large language models to perform a specific task by transforming inputs into desired outputs.

Capabilities exist on a spectrum of complexity. They can be a simple, single-step function (for example, classifying a support ticket’s intent) or evolve into a sophisticated, multi-step agent that uses reasoning and tools to achieve a goal (for example, orchestrating a complete customer support resolution).

Collection

A collection is a curated set of reference records used for development, testing, and evaluation of a capability. Collections serve as the test cases for prompt engineering.

Record

Records are the individual input-output pairs within a collection. Each record consists of an input and its corresponding expected output (ground truth).

Reference

A reference is a historical example of a task completed successfully, serving as a benchmark for AI performance. References provide the input-output pairs that demonstrate the expected behavior and quality standards.

Ground Truth

Ground truth is the validated, expert-approved correct output for a given input. It represents the gold standard that the AI capability should aspire to match.

Annotation

Annotations are expert-provided labels, corrections, or outputs added to records to establish or refine ground truth.

Grader

A grader is a function that scores a capability’s output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score or judgment. Graders are the reusable, atomic scoring logic used in all forms of evaluation.

Evaluator (Eval)

An evaluator, or eval, is the process of testing a capability against a collection of ground truth data using one or more graders. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance.

Online Eval

An online eval is the process of applying a grader to a capability’s live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement.

What’s next?

Now that you understand the core concepts, see them in action in the Rudder workflow.