Evaluating agentic systems beyond the demo
Demos lie cheerfully. These are the evals we actually run before we ship.
Everyone ships prompts. The product is what happens when the prompt fails.
The prompt is the part of an AI product people talk about. It is also the smallest part that matters. The product is what the system does when the prompt’s output is wrong, malformed, dangerous, or simply boring.
Every AI feature we have shipped spends more engineering effort on the failure path than on the happy path. That is not a complaint. It is the shape of the work.
The product is what the system does when the prompt’s output is wrong.
Hallucinated facts. Malformed JSON. Refusals where refusal is unhelpful. Confident answers to questions the user did not ask. Each of these needs a specific response: retry with a different model, fall back to a rules-based path, surface the uncertainty, ask a clarifying question, escalate to a human.
A production agentic system is a set of guardrails, validators, fallbacks, and observability hooks arranged around a core model call. The prompt is the heart. The rest is the cardiovascular system. You cannot ship with just the heart.
Demos lie cheerfully. These are the evals we actually run before we ship.
The interesting question is not whether to use AI. It is what a product becomes when you build as if intelligence is free.
The word stopped doing useful work. Here is what we replaced it with and why clients noticed.
Our build sprints are short on purpose. Here is what that forces.