On WebArena, the best browser agents now complete around 62% of web tasks. On OSWorld, one system recently crossed the 72% human-performance threshold. Both benchmarks score binary: the task either completes or it doesn't. A run that finishes 90% of the work scores identically to one that never starts.
This measurement choice has quietly become a design philosophy. When the only number that matters is full completion, there's no reason to engineer what the intermediate state looks like. No reason to preserve context about what's been tried, surface the point of failure, or structure remaining work so someone can pick it up cleanly. The partial run is scored, discarded, and forgotten.
Meanwhile, agents take 1.4 to 2.7 times more steps than humans performing the same tasks. Nearly half of agentic workflows end before five steps, with a human stepping in, and the research describes this as deliberate engineering rather than model failure. Organizations are already designing for handoff at the workflow level. The handoff itself remains unexamined: what state the task is in, what context survives, what the human inherits at the transition point.
You can watch the consequences move through the industry. Investment favors full autonomy because partial automation with clean handoff state doesn't demo well. Evaluation frameworks test whether an agent can book a flight end-to-end. A failed booking that leaves the user with a confirmed seat selection and a clear next step scores the same as a crashed browser. One major provider's computer use API remains labeled "beta" with explicit guidance to restrict it to sandboxed environments. This is how a measurement choice becomes a design choice becomes a market structure.
A recent preprint on agent reliability borrows its core principle from aviation and nuclear engineering: graceful degradation beats rare-but-catastrophic failure. Aviation is worth sitting with. Cockpit handoffs, surgical handoffs, nursing shift changes. These domains learned, through body counts, that the transition moment is where failures concentrate. They made the handoff itself a designed artifact: structured state, explicit context, clear ownership of what comes next. Checklists reshaped cockpit design, because the instrument panel had to surface the information the next person would need.
Some researchers have started pulling in the same direction. AgentBoard, presented at NeurIPS 2024, decomposes tasks into subgoals and tracks the furthest point reached, even on failed runs. OSUniverse awards partial credit by subgoal graph. Both encode a specific assumption: that value accumulates incrementally, whether or not the task completes. If evaluation frameworks scored intermediate state quality, the systems optimizing against those frameworks would have to produce legible intermediate states. The measurement would reshape the design, the same way binary completion shaped it before.
What I genuinely don't know is what "good handoff state" looks like at the level of specificity that would make it measurable. Aviation had decades and catastrophic incentives to formalize it. The 37% performance gap between lab and production documented across agent deployments won't close by pushing completion rates from 62% to 75%. It narrows when the runs that don't complete leave the human somewhere useful. That's a design surface the industry hasn't yet learned to see, partly because every evaluation instrument is pointed the other direction.
Things to follow up on...
-
Beyond binary task scoring: A December 2025 paper titled "Beyond Task Completion" proposes a four-pillar assessment framework for agentic systems, built after its authors found that agents completing tasks in a CloudOps production environment still deviated from expected policies in ways existing evaluation methods couldn't detect.
-
Benchmark validity under scrutiny: A systematic review of popular agent benchmarks found severe validity issues in 8 out of 10, including the striking finding that do-nothing agents passed 38% of τ-bench airline tasks, raising questions about what completion scores actually measure.
-
The production gap by definition: The 5x discrepancy between LangChain's self-reported 57% production deployment rate and Deloitte's externally-measured 11% suggests that what organizations count as "production" is doing as much definitional work as any technical barrier.
-
Graceful degradation as design principle: The preprint "Towards a Science of AI Agent Reliability" catalogs real-world vendor failures, including an agent that made an unauthorized $31.43 Instacart purchase, and argues that systems which fail in known, expected ways are preferable to those that fail rarely but unpredictably.

