Practitioner's Corner
When agent deployments fail, the post-mortem examines the model and the prompt. It rarely asks what the agent was never told.

Practitioner's Corner
When agent deployments fail, the post-mortem examines the model and the prompt. It rarely asks what the agent was never told.

Drop-Downs That Aren't Really Drop-Downs

Delaware's government portal goes offline at night. A pharmacy invoice site hides its download link behind a field that looks like a drop-down but behaves like a text box. A county records office redesigns its layout without warning. None of these have APIs. Humans navigate the mess without thinking. Scripts shatter against it. Somewhere between those two facts sits a bottleneck that more model capability alone won't close. One engineer spent years watching it consume his ML pipelines before deciding to build directly against it.

Drop-Downs That Aren't Really Drop-Downs
Delaware's government portal goes offline at night. A pharmacy invoice site hides its download link behind a field that looks like a drop-down but behaves like a text box. A county records office redesigns its layout without warning. None of these have APIs. Humans navigate the mess without thinking. Scripts shatter against it. Somewhere between those two facts sits a bottleneck that more model capability alone won't close. One engineer spent years watching it consume his ML pipelines before deciding to build directly against it.
Why Agent Failures Keep Getting Blamed on the Model

In surgical teams, individual surgeons hold 83% of the information they need. Anesthesiologists hold 87%. Competent people, by any measure. Yet a study of perioperative handoffs found only 27% of critical knowledge was shared across the full team. The failure lived in the gaps between skilled people, not in the skill itself.
Agent deployments are breaking in a place that looks a lot like this. Ninety percent of enterprise pilots never reach production scale, and the postmortems keep landing on the same explanation: the model wasn't capable enough. That explanation keeps being wrong. And the cause lives somewhere most diagnostic tools don't reach.
Why Agent Failures Keep Getting Blamed on the Model
In surgical teams, individual surgeons hold 83% of the information they need. Anesthesiologists hold 87%. Competent people, by any measure. Yet a study of perioperative handoffs found only 27% of critical knowledge was shared across the full team. The failure lived in the gaps between skilled people, not in the skill itself.
Agent deployments are breaking in a place that looks a lot like this. Ninety percent of enterprise pilots never reach production scale, and the postmortems keep landing on the same explanation: the model wasn't capable enough. That explanation keeps being wrong. And the cause lives somewhere most diagnostic tools don't reach.


Reasoning Once
A workflow costing $0.50 per LLM-in-the-loop execution becomes $50,000 a month at scale. That math is why a quiet architectural pattern keeps surfacing: let the agent reason through a workflow once, compile that reasoning into deterministic Playwright code, then replay the script on every subsequent run. Think expensive thoughts exactly once.
Skyvern's implementation offers the clearest public numbers. Their explore-then-replay model cuts per-run costs from $0.11 to $0.04 and execution time from 279 seconds to 120. The runs become deterministic. What makes the pattern durable is intent metadata captured during exploration, so when a site inevitably changes, the system heals using semantic understanding rather than brittle CSS selectors.
The deeper idea here resembles institutional memory. A senior employee learns a process, writes it down, and the next person follows the notes. The LLM's initial exploration is the learning. The compiled code is the documentation. And like good institutional knowledge, it degrades gracefully: when the notes no longer match reality, the system knows enough about the original intent to adapt rather than fail silently.
Further Reading




Past Articles

A government portal in Delaware goes offline at night. A dropdown is actually a textbox. A checkbox arrives pre-checked,...

Seventy-eight percent of enterprises are piloting AI agents. Fourteen percent have made it to production. That gap has h...

The IRS's Individual Master File has been running since 1961. It was supposed to be replaced decades ago. Instead, layer...

A single agent step running at 95% reliability sounds fine. Chain twenty steps and you're below 36%. That gap has to be ...

