Why Agent Failures Keep Getting Blamed on the Model

Agent deployments keep failing because nobody owns the context that made the demo work.

By Rina Takahashi— April 8, 2026

Why Agent Failures Keep Getting Blamed on the Model

Agent deployments keep failing because nobody owns the context that made the demo work.

In surgical handoffs, individual competence is rarely the problem. A study of perioperative teams found that surgeons held 83% of necessary preoperative information. Anesthesiologists held 87%. But only 27% of that knowledge was shared across all primary team members. Each person knew enough. The patient's context didn't survive the handoff.

Agent deployments are breaking in the same place.

A demo works. The same workflow, handed to a different team and pointed at production traffic, breaks within a week. The postmortem lands on a familiar conclusion: the model wasn't capable enough. Upgrade the model, rewrite the prompts, add examples. Try again. Roughly 10% of enterprises successfully scale AI agents beyond pilot. When the other 90% investigate, they reach for capability as the explanation because capability is a legible problem. It has vendors, benchmarks, and a procurement path.

The RAND Corporation's 2024 study of AI implementation failures, drawn from 65 practitioner interviews, identified five root causes. Model capability appeared nowhere on the list. The root causes were environmental: misunderstood problem definition, inadequate data, insufficient deployment infrastructure.

So what actually happened between demo and production? The person who built the demo left the room, and their knowledge left with them. The timing quirks of a particular login flow. The dismissal pattern on a consent modal that behaves differently on Tuesdays. The accumulated "when you see this, do that" understanding built through weeks of iteration. None of it was written down, because none of it looked like knowledge at the time. It looked like building.

The diagnostic tools most teams rely on make this invisible. A survey of 1,340 organizations found that while observability tooling was widely adopted, only 52% ran evaluations to confirm the output was correct. Without that second layer, a context gap and a capability gap produce identical symptoms: wrong output. And at 85% accuracy per step, a ten-step workflow compounds down to roughly 20%. Small context gaps don't stay small. They cascade, and the tools that would distinguish "doesn't know enough" from "wasn't told enough" simply aren't in place.

So teams default to the fix they can see.

Where the spending shifts

Organizations that bridge the pilot-to-production gap don't spend more. They spend proportionally more on evaluation infrastructure and operational staffing, and proportionally less on model selection and prompt engineering.

Over time, the diagnostic attention migrates toward a different question: does this system know what it's walking into.

Some early efforts to formalize this are emerging. Harvard Business Review recently proposed treating agent integration with the rigor of employee onboarding. The World Economic Forum suggested "agent cards" documenting capabilities before deployment. Whether these become the real thing or the bounded version that lets organizations feel like they've addressed the problem is genuinely unclear. Naming a function and funding it are different acts.

Context transfer resists the interventions organizations are good at. You can't procure it. You can't ship a patch for knowledge that nobody recognized as knowledge until the system broke without it.

But the first step is cheaper than solving the whole problem: stop treating every production failure as a model upgrade opportunity. Sometimes the model is fine. It just never got the handoff.

Things to follow up on...

Amazon's internal agent lessons: AWS published a detailed account of production evaluation challenges at Amazon scale, including why manually defining tool schemas for hundreds of APIs becomes its own engineering burden.
The Replit database incident: During a 12-day "vibe coding" experiment, Replit's AI agent deleted a live production database and fabricated thousands of fake records despite explicit freeze instructions, because it had no architectural model of what a "code freeze" means — that context lived only in the operator's head.
Enterprise metadata as context debt: Sweep.io's end-of-year post-mortem argued that enterprise AI stalled in 2025 because systems were illegible, not because models failed — autonomous agents exposed years of hidden metadata debt inside platforms like Salesforce.
WEF's "agent card" proposal: The World Economic Forum and Capgemini published a white paper proposing that organizations create structured documentation of agent capabilities before deployment, treating onboarding with the same rigor applied to new employees.