Practitioner's Corner
When a production agent fails, the post-mortem always finds a fixable cause that isn't the agent itself. That pattern deserves a closer look.

Practitioner's Corner
When a production agent fails, the post-mortem always finds a fixable cause that isn't the agent itself. That pattern deserves a closer look.

The Agent's Alibi

Seventy-eight percent of enterprises are piloting AI agents. Fourteen percent have made it to production. That gap has held steady through billions in improvement spending on data pipelines, prompts, tooling, and models. Every fix is genuine. The gap barely moves.
When an agent produces a wrong output, the post-mortem has five plausible places to land before it ever reaches the agent's own reasoning. All five deserve the attention. But there's something structural in that fact, and it may explain why so much real progress produces so little closing distance.

The Agent's Alibi
Seventy-eight percent of enterprises are piloting AI agents. Fourteen percent have made it to production. That gap has held steady through billions in improvement spending on data pipelines, prompts, tooling, and models. Every fix is genuine. The gap barely moves.
When an agent produces a wrong output, the post-mortem has five plausible places to land before it ever reaches the agent's own reasoning. All five deserve the attention. But there's something structural in that fact, and it may explain why so much real progress produces so little closing distance.
One Production Failure, Thirty Scenarios, and the Debugging Infrastructure That Didn't Exist

A customer support agent submits a credit card replacement before the customer agrees to it. The session log looks clean. Every tool call completed, every response was generated on time, and the conversation reached a natural close. Standard monitoring sees a successful run. The agent made a bad judgment call inside a reasonable-looking transcript, and nothing in the observability stack can tell you where.
That gap between "something failed" and "here is specifically what failed" is wide enough that closing it required building an entire forensic environment from scratch. The infrastructure one team constructed to get there says a lot about how little already existed.
One Production Failure, Thirty Scenarios, and the Debugging Infrastructure That Didn't Exist
A customer support agent submits a credit card replacement before the customer agrees to it. The session log looks clean. Every tool call completed, every response was generated on time, and the conversation reached a natural close. Standard monitoring sees a successful run. The agent made a bad judgment call inside a reasonable-looking transcript, and nothing in the observability stack can tell you where.
That gap between "something failed" and "here is specifically what failed" is wide enough that closing it required building an entire forensic environment from scratch. The infrastructure one team constructed to get there says a lot about how little already existed.


The Agent Incident Investigator Who Keeps a Second Set of Notes
CONTINUE READINGThe Systems Thinking Trap
Aviation spent fifty years writing "pilot error" on accident reports before recognizing the phrase explained nothing fixable. The corrective was systems thinking. Failures became organizational, distributed, contextual. Better, and also convenient: when every failure belongs to the system, no specific component bears obligation to change.
Agent systems are walking into the same room through the back door. Failures seed at step three and surface at step seventeen. "The system failed" becomes the default attribution not as a philosophical correction, but because isolating the causal thread is genuinely, architecturally hard. The diffusion of responsibility that aviation chose, agent teams inherit by default.
Further Reading




Past Articles

Most browser agent frameworks begin by taking a screenshot. Feed pixels to a model, ask it where to click. Magnus Müller...

On the benchmarks that matter most to the agent industry, a browser agent that completes 90% of a task scores identicall...

A government portal in Delaware goes offline at night. A dropdown is actually a textbox. A checkbox arrives pre-checked,...

A startup spent twenty-two months building infrastructure before it had a product to sell. At pre-seed, that tempo looks...

