A customer support agent handling credit card replacements goes sideways. Sometimes it submits a replacement before the customer agrees. Sometimes it stalls, staying vague, never clarifying what the customer wants. Both behaviors look reasonable in a log file. Neither is correct. And standard observability wouldn't flag either one. The agent ran. The tools were called. The session completed.
The distance between "something failed" and "here is specifically what happened" is enormous. Veris AI, a New York-based startup founded by Mehdi Jamei and Andi Partovi, has spent the past year building infrastructure to close it. Their pipeline, documented on Microsoft Community Hub through a joint build with enterprise security firm Lume Security, reconstructs what happened in a failed agent session, why it happened, and what to do about it. The complexity of that reconstruction is itself a measure of how primitive agent debugging currently is.
Following the chain
When something goes wrong, teams have session logs. Timestamps, tool calls, model outputs. The equivalent of a flight data recorder, except nobody has built the accident investigation board. Most debugging today means a developer manually stepping through traces, freezing randomness by lowering temperature to zero, testing hypotheses about which layer failed. That works for one incident. It doesn't scale.
So Veris built an LLM-based evaluator that reads the full trajectory of an agent session and identifies where behavior diverged from intent. The whole decision path, the full arc of the conversation. The agent made a bad judgment call somewhere in there, and finding that call means understanding context across every exchange.
Once you've identified the failure point, a second problem surfaces. Standard evaluation rubrics, where an LLM judge rates performance on a 1-to-5 scale, are wildly inconsistent. Veris's own research found that broad checks like "were errors handled gracefully?" achieved only 43% consistency between evaluations. Narrow, failure-specific checks like "did the agent ignore a tool error?" hit 98%. So the platform automatically generates targeted rubrics from each new failure class. The evaluation instrument gets built from the failure itself.
And one incident is one data point. Optimizing on a single example means overfitting to that exact scenario. Veris's scenario engine takes the production log and expands it into ~30 variants, adding branching and variety to explore the edges of that failure mode. These run against simulated tools and users, get evaluated against the targeted rubrics, and the results show whether the fix actually worked or just shifted the failure somewhere else. New scenarios always run alongside the full existing library, so fixing one failure class doesn't quietly degrade performance elsewhere.
The pipeline as evidence
Each stage exists because the previous one wasn't sufficient: session logs → semantic evaluation → targeted rubrics → scenario expansion → regression protection.
That chain is the point. Jamei has described the core problem as companies jumping straight from model-building to production without intermediate environments. The forensic infrastructure his team built is one measure of how wide that gap actually is. Every layer in the pipeline exists because the question "what went wrong?" kept requiring another layer of machinery to answer. An entire environment where failures could be reproduced, varied, measured, and learned from. The environment exists now, for the teams that recognize they need it. How many teams even know they're flying without an accident investigation board is a question the pipeline can't answer for them.
Things to follow up on...
- The pilot-production gap widens: A March 2026 survey of 650 enterprise leaders found 78% have agent pilots running but only 14% have reached production scale, with the gap widening as agents grow more capable.
- Agent error taxonomies emerge: A research team proposed AgentDebug, a modular classification of failure modes spanning memory, planning, action, and system-level operations, suggesting the field is beginning to formalize what "going wrong" actually means.
- Evaluation as the critical gap: The newly formed Agentic AI Foundation, co-founded by Anthropic, Block, and OpenAI, has identified evaluation infrastructure as the most urgent missing layer, with the best models scoring under 23% on realistic benchmarks.
- Cost of undetected failures: Analysis of multi-agent production systems shows accuracy dropping from 95-98% in pilots to 80-87% under real-world pressure, with the most dangerous failure mode being outputs that look correct while nothing alerts until real damage surfaces.

