Evaluating agentic systems beyond the demo

Demos lie cheerfully. These are the evals we actually run before we ship.

A good demo proves a system can succeed once. A good eval proves it succeeds repeatedly, on inputs you did not choose, in conditions you did not control. These are very different claims, and teams ship on the first when they need the second.

Layer one: golden paths

Every agentic workflow gets a set of golden-path scenarios, the cases the product is supposed to handle. We instrument each one end-to-end, record the full trace, and diff new runs against the baseline. If a golden path regresses, the deploy is blocked.

Layer two: adversarial inputs

Golden paths catch regressions. They do not catch novel failure modes. For that we keep a growing adversarial set: inputs that have broken the agent in the past, inputs that plausibly could, inputs crafted to exploit the ways the model thinks. This set only ever grows.

A good eval proves the system succeeds on inputs you did not choose, in conditions you did not control.

Layer three: LLM-as-judge, carefully

We use model-graded evals for rubric judgments, tone, format, completeness, but never alone. Every judge output is spot-checked by a human until we trust the rubric. Judges drift. Humans catch the drift.

Layer four: production replay

The last and most important layer: we capture real production traces, strip sensitive data, and replay them against candidate versions. The only eval set that matters long-term is the one your users write for you.