A research team ran GPT-4o on the same retail agent tasks eight times. On any single attempt, the model succeeded about two-thirds of the time. Measured across all eight attempts, the probability of succeeding every time dropped below 25%. Same model, same tasks, same tools, same everything. The only variable was the sampling stochasticity inherent in how language models generate outputs.
That gap between 65% and 25% describes a property of sequential non-deterministic systems that the agent ecosystem has no standard way to measure, no shared vocabulary for discussing, and an evaluation infrastructure structurally unable to see.
Reliability in these systems behaves like a consumable resource. Every step where a model makes a decision, chooses a tool, or interprets an ambiguous result spends from a finite budget. The math is just multiplication. A system with 95% reliability at each model-directed step sounds robust. But watch what happens as steps accumulate:
| Sequential model-directed steps | Cumulative reliability |
|---|---|
| 1 | 95% |
| 5 | 77% |
| 10 | 60% |
| 20 | 36% |
The arithmetic is trivial. The constraint it implies is nearly invisible to the people operating under it.
Why the collapse looks sudden
A February 2026 study traced what happens inside agent runs where the same model succeeds sometimes and fails other times on identical tasks. The researchers found a drift coefficient of +0.227: each step that deviates from the canonical solution path increases the probability of the next deviation by roughly 23 percentage points.
Spending accelerates spending.
The finding that should unsettle anyone building these systems: the divergence between successful and failed runs is statistically indistinguishable through the first half of execution. At 10%, 25%, even 50% of the trajectory, the runs look identical. Drift accumulates beneath the surface, becoming measurable only after about 75% completion. By the time failure is visible, the budget has been exhausted for a while. The collapse has been building long before it shows up in any signal you can act on.
And this invisibility is structural. Better tooling won't resolve it. The sample complexity for attributing outcomes to early steps grows exponentially with the number of intervening steps. Measuring where the budget was spent is expensive for the same reason the budget matters: the signal connecting early decisions to final outcomes decays exponentially with depth. The constraint hides from the instruments you'd use to find it.
Architecture as instinct
If you look at enough production agent systems, a pattern emerges. The outer loop is always deterministic code. Model reasoning lives inside, bounded within stages where its unpredictability can't cascade outward. Coding agents delegate search and file editing to constrained workers. Research systems let sub-agents explore in parallel but synthesize in code. Every serious team converges on this shape.
Whether they'd describe it this way or not, they're allocating a budget. Every decision about where to place a model-directed step is a decision about how much non-determinism the workflow can afford. Moving a step from model-directed to code-driven stops that step from consuming reliability budget entirely. Well-specified tool interfaces serve the same function from a different angle: constraining the model's decision space at each call stretches the budget further. One study found that semantic weighting of tool interfaces reduced cumulative distortion by 80%, and that periodic re-grounding approximately every nine steps was sufficient to keep error accumulation under control.
These are sophisticated engineering responses. But they're arrived at through instinct and hard-won experience. No team I'm aware of measures its non-determinism budget, tracks how much of it a given workflow consumes, or sets explicit thresholds for how many sequential model-directed steps a pipeline can tolerate. They're making allocation decisions about a resource they've never quantified.
What the benchmarks can't see
The evaluation ecosystem reports single-attempt capability on short tasks. SWE-bench measures whether a code patch passes tests on one try. WebArena and GAIA report pass@1. None run the same task repeatedly to see if the agent succeeds every time. None measure how reliability decays as task duration increases.
The first systematic attempt to characterize this decay, an April 2026 study spanning 396 tasks and over 23,000 episodes, found that capability and reliability rankings diverge substantially as task horizons lengthen. The most capable models exhibited the highest meltdown rates, up to 19%, because they pursued more ambitious multi-step strategies. At long horizons, the models that score best on capability benchmarks can be the least reliable. And trajectory-opaque evaluation misses 44% of safety violations that trajectory-aware grading catches, because the failures live in the sequence of decisions, not in the final output.
The benchmarks measure what's affordable to measure. Multi-trial evaluation at frontier scale costs thousands of dollars per run. The economics of measurement and the economics of the problem are the same economics.
What it means to navigate by feel
SRE teams solved something superficially similar with error budgets: the gap between your SLO and 100% is how much unreliability you can afford to spend. But an SRE error budget replenishes monthly and measures crisp events. An agent's non-determinism budget exhausts within a single run. Failure is behavioral and probabilistic. There is no equivalent of a 5xx error for "the model chose a suboptimal tool at step 14 and the trajectory never recovered."
No SRE team would ship a production system without measuring its error budget. Agent teams do this routinely. They make consequential architectural decisions about where to place model-directed steps, how many sequential LLM calls a workflow can tolerate, where to insert deterministic anchors. They often make these decisions well.
But they make them without a number. And the constraint they're navigating is, by its mathematical nature, invisible until it's too late and expensive to measure even in retrospect. That might be the state of the art for longer than anyone building these systems is comfortable admitting.
Things to follow up on...
- The meltdown onset paradox: The "Beyond pass@1" framework introduces a metric called the Meltdown Onset Point that detects behavioral collapse in frontier models, finding that the most capable models melt down most often because they pursue more ambitious multi-step strategies.
- Checkpointing versus durable execution: A sharp Diagrid blog post argues that what most agent frameworks call durability is actually just save points that developers must manually trigger and coordinate, a distinction that matters enormously when reliability is already a scarce resource.
- Reliability as a measurement surface: ReliabilityBench proposes evaluating agents across a three-dimensional reliability surface that captures consistency, robustness to perturbation, and fault tolerance under infrastructure failures simultaneously.
- Code-driven versus LLM-driven orchestration: A useful practitioner guide from Genta.dev walks through when model-directed versus deterministic orchestration belongs in a system, with concrete examples of how the boundary placement shapes production outcomes.

