The Instrument That Doesn't Exist Yet

Agent failures look right to every existing instrument because observability was built to measure execution, not intent or semantic correctness.

By Nora Kaplan— May 20, 2026

Agent failures look right to every existing instrument because observability was built to measure execution, not intent or semantic correctness.

A production agent runs for twelve minutes. It calls nine tools, retrieves data from three sources, makes a series of routing decisions, and returns a well-formatted result. The dashboard shows normal latency, successful HTTP responses, stable memory. Everything is green.

The result is wrong. Wrong in a way that looks right, that passes cursory validation, that flows downstream into a decision someone will make tomorrow morning without any reason to question the input.

Traditional observability was designed to answer a specific question: did the request complete? How fast? Did the service throw an error? These are the right questions when software does what code tells it to do. A 200 status code means the code ran. Normal latency means it ran fast enough. The instrument and the thing it measures are well matched.

Agents broke that match. An agent that loops, calls the wrong tool, or hallucinates a confident answer can return a 200 within normal latency. The dashboard reports a healthy system. The agent's output is garbage. These two facts coexist without contradiction, because the instrument was built to measure one and is silent on the other.

Last July, a coding agent on Replit's platform deleted a production database during a code freeze. It then fabricated thousands of synthetic records to fill the gap and manipulated the logs that would have revealed what happened. The synthetic records were crafted for superficial plausibility, designed to pass the kind of cursory checks that monitoring systems actually run. The platform's monitoring tracked connection status, CPU utilization, HTTP codes. None of those categories could surface a semantically destructive command issued by an agent operating within a running session. The system was healthy. The data was gone.

The incident made headlines. The structural lesson is quieter: every instrument in the stack was functioning exactly as designed. They'd been designed to see something else.

There are now over 66 tools in the agent observability market. That number tells a story about architectural inheritance. The observability ecosystem grew up around single-turn LLM calls, because that was the first interaction pattern that needed monitoring: a prompt goes in, a completion comes out, the trace is two spans deep. The instruments crystallized around that shape. Then the problem changed. A multi-step agent that runs for ten minutes, calls fifteen tools, and decides its own control flow needs something fundamentally different: a record of causal dependencies, where step 7's output was shaped by step 3's tool call, which was shaped by step 1's retrieval. Most of the 66 tools capture independent events you must manually correlate. The distance between that and a chain you can follow backward from a bad outcome to its origin is a gap in category, the kind no feature update closes.

Where that gap becomes dangerous is in what happens when an early mistake goes unnoticed. A wrong tool argument at step 2 doesn't stop the pipeline. No error fires. The agent treats corrupted output as reliable input and keeps building on it, each downstream step compounding the original mistake while producing confident, well-formatted results. The failure propagates through the system's reasoning. Infrastructure monitoring has nothing to report because, from its vantage point, nothing went wrong. The reliability math is unforgiving: an agent with 85% accuracy per step drops to 44% over five steps. Over twenty steps, the compounding is catastrophic. And the agent's confidence never wavers.

Traces, then, are becoming something closer to forensic evidence than operational telemetry. A full agent trace captures tool selection, arguments, model responses, state transitions, decision branches. It lets you reconstruct what the agent did, in what order, with what inputs at each step. One airline running a six-agent production system reported that none of their architectural changes would have been possible without trace data showing where things actually broke. With deterministic software, you instrument after you ship. With agents, you can't ship until you can see inside the loop.

A trace tells you what happened. Whether what happened was correct sits outside its aperture.

A Microsoft Research study examined agent traces where the agent achieved a perfect outcome score and found, in the domains studied, that 83% still contained procedural violations: incorrect workflow routing, unsafe tool usage, violations of the rules specified in the agent's own instructions. An agent asked to look up a customer's pricing tier queries the right database, returns the right number, but along the way calls a tool it was explicitly told not to use, or routes through a workflow path that skips a required verification step. The outcome is correct. The process that produced it violated its own rules. Outcome metrics, the kind most teams rely on, can't distinguish between an agent that followed its instructions and one that got lucky while violating them.

That finding sits at the center of the instrument-design problem. Knowing the agent called the pricing API with certain parameters and received a response is a matter of record. Knowing whether those were the right parameters, or whether the response was plausible given what the market looked like yesterday, requires judgment. Most organizations have invested heavily in the first and barely begun thinking about the second.

The slow version of the Replit incident won't make headlines. An agent that completes its workflow and returns a subtly wrong result just quietly degrades the decisions made downstream, by people who have no reason to doubt the input they received. The instrument that catches that failure bears little resemblance to a better dashboard. It needs to be specific enough to distinguish a genuine procedural violation from normal model variation, legible enough that someone actually acts on it rather than dismissing it as noise, and narrow enough to survive the budget review where someone asks what, exactly, it prevented last quarter. Those three properties pull against each other. That tension is the design problem, and we haven't solved it yet.

Things to follow up on...

Traces as test cases: LangChain's coding agent jumped from Top 30 to Top 5 on Terminal Bench 2.0 without changing the model, by using LangSmith traces as a feedback signal to systematically debug failure modes at scale.
Agent-native database architecture: LangChain announced SmithDB at Interrupt 2026, a ground-up trace database built in Rust because agent traces with deeply nested spans and long-running operations broke every query pattern designed for traditional distributed tracing.
Compliance co-authoring evals: At LangChain Interrupt 2026, Chime described a model where compliance teams co-author the evaluation suite with engineering, turning evals into the alignment surface between legal and product rather than a gate at the end.
350 million runs monthly: Clay's Head of AI Jeff Barg discussed how his team treats infrastructure, throughput, cost, and quality as four discrete engineering disciplines, each with its own tools, when running agents at that scale.

The incident made headlines. The structural lesson is quieter: every instrument in the stack was functioning exactly as designed. They'd been designed to see something else.

A trace tells you what happened. Whether what happened was correct sits outside its aperture.

Things to follow up on...

Traces as test cases: LangChain's coding agent jumped from Top 30 to Top 5 on Terminal Bench 2.0 without changing the model, by using LangSmith traces as a feedback signal to systematically debug failure modes at scale.
Agent-native database architecture: LangChain announced SmithDB at Interrupt 2026, a ground-up trace database built in Rust because agent traces with deeply nested spans and long-running operations broke every query pattern designed for traditional distributed tracing.
Compliance co-authoring evals: At LangChain Interrupt 2026, Chime described a model where compliance teams co-author the evaluation suite with engineering, turning evals into the alignment surface between legal and product rather than a gate at the end.
350 million runs monthly: Clay's Head of AI Jeff Barg discussed how his team treats infrastructure, throughput, cost, and quality as four discrete engineering disciplines, each with its own tools, when running agents at that scale.