Prompt is not the product
Everyone ships prompts. The product is what happens when the prompt fails.
Demos lie cheerfully. These are the evals we actually run before we ship.
A good demo proves a system can succeed once. A good eval proves it succeeds repeatedly, on inputs you did not choose, in conditions you did not control. These are very different claims, and teams ship on the first when they need the second.
Every agentic workflow gets a set of golden-path scenarios, the cases the product is supposed to handle. We instrument each one end-to-end, record the full trace, and diff new runs against the baseline. If a golden path regresses, the deploy is blocked.
Golden paths catch regressions. They do not catch novel failure modes. For that we keep a growing adversarial set: inputs that have broken the agent in the past, inputs that plausibly could, inputs crafted to exploit the ways the model thinks. This set only ever grows.
A good eval proves the system succeeds on inputs you did not choose, in conditions you did not control.
We use model-graded evals for rubric judgments, tone, format, completeness, but never alone. Every judge output is spot-checked by a human until we trust the rubric. Judges drift. Humans catch the drift.
The last and most important layer: we capture real production traces, strip sensitive data, and replay them against candidate versions. The only eval set that matters long-term is the one your users write for you.
Everyone ships prompts. The product is what happens when the prompt fails.
The interesting question is not whether to use AI. It is what a product becomes when you build as if intelligence is free.
The word stopped doing useful work. Here is what we replaced it with and why clients noticed.
Our build sprints are short on purpose. Here is what that forces.