A piece of arithmetic should be shaping more product roadmaps than it currently does.
If each step in a multi-step agent workflow succeeds 95% of the time, a 20-step workflow completes about 36% of the time. Every individual step looks reliable. Multiplication is just unforgiving.
| Steps in workflow | End-to-end success at 95%/step | End-to-end success at 97.5%/step |
|---|---|---|
| 5 | 77% | 88% |
| 10 | 60% | 78% |
| 20 | 36% | ~60% |
The curve is steep and it does not negotiate.
And 95% is generous. The best models on SWE-bench Pro score around 23% on coding tasks requiring multi-step reasoning. These are structured domains with well-defined success criteria. The live web, with shifting layouts and adversarial detection, is less forgiving still.
What the math forces
You can see compound error rates reshaping production most clearly in what organizations actually ship.
A UC Berkeley study surveying 306 practitioners found that 47% of production agents execute fewer than five steps before requiring human intervention. Sixty-eight percent cap at ten. In the study's detailed case analyses, 80% use predefined static workflows rather than open-ended autonomy. The researchers describe this as deliberate engineering. Organizations are building to the math.
Call it workflow compression: the distance between what teams design on a whiteboard and what they deploy. Ambitious multi-step architectures get shortened, checkpointed, and bounded until end-to-end success rates clear a usable threshold. Anthropic's own guidance on building agents makes the point directly:
"Optimizing single LLM calls with retrieval and in-context examples is usually enough."
The vendor building frontier models is telling you to use fewer steps.
What each nine costs
The path from demo to production has been described as a "march of nines": getting from 90% to 99% to 99.9% reliability, where each additional nine requires roughly the same effort as the last.
One widely cited estimate puts it starkly: reaching 80% reliability takes 20% of the effort, but production demands 99% or better, and that last stretch can take 100x more work.
This creates a specific economic bind for longer workflows. The human checkpoints that make ten-step chains viable were supposed to be temporary scaffolding. Improving step reliability from 95% to 99% requires systems work, a different kind of engineering than improving from 80% to 95%. Better prompts won't close that gap. While that systems work proceeds, the checkpoints persist.
The Berkeley study found 74% of production agents depend primarily on human evaluation. Think about what that means for the business case. The automation was justified by replacing task labor. Yet every chain longer than five steps still needs a human reviewing intermediate outputs, approving handoffs, catching confident-but-wrong results. The labor has migrated. Someone still exercises judgment, only now about system behavior on top of domain knowledge. That kind of oversight tends to cost more. And the oversight required to make longer chains viable starts consuming the savings that justified automating them. The business case hollows out gradually, not in a single visible collapse.
Single-step benchmarks, multi-step terrain
The industry measures progress in single-step capability. Benchmark scores improve. Reasoning gets sharper. Context windows expand.
All genuine advances. The multiplication problem remains untouched by any of them. Halving the failure rate at each step, from 95% to 97.5%, takes a 20-step workflow from 36% end-to-end success to about 60%. Better, genuinely. Still nowhere near production-viable. Still nothing you'd build a business process around.
How many decision points a workflow requires ends up mattering more than how smart the agent is at any given one of them.
Organizations deploying agents into production already understand this. They build short, checkpointed, tightly scoped workflows. The industry's investment thesis, its benchmarks, its product roadmaps, still largely optimizes for single-step performance and deploys into multi-step terrain. The math doesn't care about the roadmap. It's already determining which workflows survive contact with production and which quietly get shortened until they do.
Things to follow up on...
- Amazon's evaluation rethink: Amazon's engineering blog describes how building thousands of internal agents since 2025 forced a fundamental shift in evaluation methodologies away from model performance toward whole-system success rates.
- LangChain's production survey: The State of Agent Engineering survey of 1,340 practitioners found that quality, not capability, is the top production barrier at 32%, with 94% of production agents already having observability in place.
- Graph-based execution replacing chains: One infrastructure analysis argues that graph-based execution has definitively replaced linear chains in production, partly as an architectural response to compound error rates in sequential workflows.
- Gartner's cancellation forecast: Over 40% of agentic AI projects are expected to be canceled or fail to reach production by 2027, a figure corroborated by S&P Global data showing organizations scrap 46% of AI proof-of-concepts before deployment.

