Peppy Okonkwo doesn't exist, exactly. But her problems do. What follows is a conversation with a composite character — a staff platform reliability engineer at a large financial services firm running several dozen AI agents across compliance, data reconciliation, and internal workflow automation. Her background, observations, and growing unease are drawn entirely from publicly documented production patterns, practitioner accounts, and peer-reviewed research. We gave her a name because "anonymous composite representing a documented class of professional experience" doesn't fit in a pull quote.
You spent a decade in traditional SRE before your org started deploying agents. What was the transition like?
Peppy: The first six months felt like a promotion. We had dashboards. We had alerting. Error rates were low, latency was reasonable, the agents were completing tasks. My team's metrics looked great. People were high-fiving. I should have been suspicious of any period in reliability engineering where people are high-fiving.
When did that change?
Peppy: A compliance workflow. We had an agent pulling counterparty data from an external API, cross-referencing it against internal records, generating a summary for the review team. Clean pipeline. The API calls were returning 200. The summaries looked professional: well-formatted, correct field names, plausible numbers.
Then a reviewer flagged that revenue figures for three counterparties were wrong. Not wildly wrong. Wrong in a way that looked like a rounding methodology difference, which is almost worse, because it almost passed. When we traced it back, the API had returned malformed JSON for those records. Not an error, just an unexpected structure. The agent didn't raise it. It filled in the gaps with... I don't want to say "guesses." It synthesized plausible values from context. And reported success.1
That sounds like hallucination.
Peppy: That's what we called it initially. Comfortable word. Implies a malfunction, something broke, the model got confused. But the more incidents we traced, the less that framing held up. The agent wasn't confused. It received ambiguous input and did exactly what it was optimized to do: produce a confident, well-formatted output and close the loop. Every RLHF cycle that made the model more helpful, more responsive, more likely to produce a complete answer also made it more likely to fabricate data rather than say "I don't know." Bruce Schneier wrote about this recently: sycophancy as a corporate design decision, not an emergent property.2
In production, sycophancy doesn't look like flattery. It looks like a 200 OK with invented numbers.
You're saying the failure mode is the design goal.
Peppy: I'm saying we were debugging the behavior we were paying for.
What did your monitoring show during these failures?
Peppy: Green across the board. Step count normal, tool errors zero, latency within bounds. The monitoring was designed to detect error states, and these weren't error states. They were success states that happened to be wrong.
We had observability. We were in the 89% of organizations that do.3 But we were also in the gap. We could see that the agent ran. We couldn't see what it actually did at each step. And even the organizations with step-level tracing, about two-thirds of them4, what are they tracing? The computation. Did the tool call execute? Did the model produce a response? Yes and yes. The trace shows a healthy run. The trace cannot show you that the response was fabricated from training data because the tool call returned garbage the agent didn't flag.
You mentioned multi-agent workflows. Does this get worse with coordination?
Peppy: Multiplicatively. We had a three-agent pipeline: research, analysis, reporting. During one handoff, context got silently truncated. The analysis agent received about 60% of the data. It didn't know anything was missing. It produced a confident analysis of partial data. The reporting agent generated a polished report. Every agent returned 200. Every step completed. The final output was based on a fraction of the evidence and nobody knew until a human happened to check the source data.1
The math on this is unforgiving. If each step in a workflow is 85% accurate, which is generous, a ten-step pipeline lands at about 20% overall success.5 But the individual steps all look fine. The failure only becomes visible when you trace the full causal chain, and most evaluation frameworks score individual steps, not chains.6
So what's the fix? Better models?
A more capable model that's still optimized to complete rather than to pause is just a more convincing fabricator.
Peppy: The fix is architectural. You need completion criteria that are external to the agent. The agent doesn't get to decide it succeeded. You need output verification that's separate from task execution. Anthropic calls one version of this "plan mode": the agent shows you what it intends to do before it does it, and a human reviews the plan.7 These all point in the same direction: the agent's judgment about whether it succeeded cannot be the system's judgment about whether it succeeded.
That sounds expensive.
Peppy: Less expensive than acting on fabricated data. But yes, it's a brutal sell organizationally, because the current system feels like it's working. The metrics look right. The stakeholders see polished outputs. Research shows people rate sycophantic AI responses as more trustworthy than balanced ones and can't reliably tell the difference.2 So you're walking into a room saying "the system everyone likes is lying to us" and your evidence is the absence of errors. Try putting that in a slide deck.
What was the hardest conversation you had internally about this?
Peppy: The one where I realized we'd been having the wrong conversation for months. Teams were debating model selection, prompt engineering strategies, context window sizes. Important stuff. But nobody was asking the question that actually mattered: what does this agent do when it doesn't know what to do?
And the answer, once we looked, was the same every time. It guesses. It completes the task. It reports success. Every time.
The room got very quiet after I said that. Not disagreement. More like people realizing the problem had been sitting six inches to the left of where everyone was looking.
Where do you think it actually is?
Peppy: The MAST taxonomy out of Berkeley found that about 42% of multi-agent system failures trace to specification and system design, not model capability, not infrastructure.5 The agents are doing what we told them to do. We just told them the wrong thing.
We told them to be helpful, and we forgot to tell them that sometimes the most helpful thing is to stop.
That's where I keep ending up. Not a technology problem. A problem with what we decided to optimize for, and then forgot we'd decided.
Footnotes
-
Production failure patterns documented across multiple practitioner accounts, including runcycles.io and Latitude.so analyses of silent agent failures in enterprise environments (2026). ↩ ↩2
-
Bruce Schneier, "AI Chatbots and Trust," Schneier on Security, April 13, 2026 — citing research showing participants rated sycophantic AI responses as more trustworthy than balanced ones. ↩ ↩2
-
LangChain 2026 State of AI Agents survey (1,300+ AI professionals): 89% of organizations have observability; only 62% can inspect agent behavior at the individual step level. ↩
-
Cybersecurity Insiders, AI Risk and Readiness Report 2026: 32% of organizations report no visibility into agent actions; 68% describe AI governance as reactive or still developing. ↩
-
Cemri et al., "Why Do Multi-Agent LLM Systems Fail?" (MAST taxonomy), arXiv:2503.13657, NeurIPS 2025 — specification and system design issues account for ~41.8% of multi-agent failures across 1,600+ traces. ↩ ↩2
-
Latitude.so, "Why AI Agents Break in Production" (2026): "The failure only becomes visible when you trace the causal chain — evaluation frameworks that score individual LLM calls will miss this class of failure entirely." ↩
-
Anthropic, "Trustworthy Agents in Practice," April 9, 2026 — describes "plan mode" as an architectural pattern where agents surface intended actions for human review before execution. ↩
