Dashboards report 98% success rates. Logs capture every decision. Traces follow execution paths across distributed systems. Customers report broken workflows. The disconnect sits in plain view.
Atindriyo Sanyal built Galileo's observability platform after years bringing thousands of AI models into production at Uber and working on early Siri versions at Apple. When he co-founded Galileo in 2021, the team assumed the problem was visibility. Give teams the right metrics, they figured, and reliability would follow.
Production had other lessons to teach.
Debugging Assumptions Break Down
Traditional debugging assumes you can reproduce failures. Set a breakpoint, step through the code, trace the path that led to the error. Agent systems produce different outputs from identical inputs—as fundamental behavior, not edge cases. Breakpoints don't help when systems reason non-deterministically. Unit tests can't verify reproducible behavior that doesn't exist. Linear logs presume a single path through code that agents don't follow.
Web automation surfaces a similar dynamic, though the mechanism differs. An authentication failure early in a session corrupts state three steps later. A caching decision made during one request contaminates subsequent ones. You're debugging what looks like a data quality issue, but the root cause sits several interactions back. The observability platform captured everything. Finding how the contamination propagated is another matter.
Sanyal's team documented what they call "memory poisoning" in agent systems. An agent hallucinates data early in a workflow. That contaminated information passes to downstream agents, who incorporate it into their analyses. Three interactions later, accuracy degrades gradually without triggering immediate failures. Teams spend hours debugging subtle data quality issues without realizing the root cause started several steps back with a single agent's mistake.
Full visibility into what happened leaves you still needing to understand why the agent chose that particular path, or what would have happened if it had chosen differently.
Production Scale
At Uber, Sanyal helped build infrastructure serving over a billion users. That experience taught him how to think about systems at scale. Agent systems introduced failure modes he hadn't encountered before. Multi-agent systems show failure rates ranging from 41% to 86.7% in production—from coordination breakdowns that emerge from how agents interact, not from coding errors.
Those numbers mean something specific in production. At 41% failure rate with 10,000 daily workflows, that's 4,100 broken executions. Each one potentially missing a competitive move, delivering wrong data to a decision-maker, or exposing information that shouldn't be visible. Failures distribute across thousands of interactions, which makes them harder to see and more expensive to fix.
Two agents enter a communication loop, exchanging redundant information due to a coordination protocol bug. Hours pass before anyone notices.
You're reviewing your infrastructure costs and see an unexpected spike. $3,200 in token consumption over a weekend when traffic should have been minimal. You trace it back through logs. The two agents started their loop Friday afternoon. By Monday morning, they'd exchanged 47,000 messages, none of which advanced any actual work.
Traditional monitoring flagged nothing because HTTP errors never occurred. The system was working exactly as designed. The design was wrong.
An orchestrator delegates a financial calculation with an ambiguous success criterion. The specialist completes its task within technical parameters but misinterprets the business constraint. Three downstream agents incorporate this flawed output. Errors compound through the workflow. By the time it surfaces as a customer-visible failure, you're debugging five separate log streams trying to reconstruct what happened.
Traditional observability captures model-level metrics but misses what matters:
- Distributed agent network tracing
- Emergent behavior detection
- Inter-agent communication bottleneck analysis
Complexity emerges from how agents interact.
Debugging Non-Deterministic Systems
When systems reason non-deterministically, debugging requires systems that can reason about non-deterministic behavior.
In July 2025, Galileo launched what it calls an Insights Engine. The system ingests logs and metrics, then uses specialized evaluation models to identify failure modes and surface actionable recommendations tied to specific traces.
The architectural choice acknowledges a production reality: you can't debug non-deterministic systems with deterministic tools. At thousands of execution traces, patterns emerge that humans can't spot manually. The Insights Engine reasons about why agents made specific choices, recognizes patterns across thousands of executions, and surfaces the insights that matter for the next deployment.
The gap between seeing what happened and understanding why it happened is where production reliability actually lives. Sanyal's work at Galileo exposes this gap at scale. Observability provides visibility into system behavior. Debuggability demands systems that can reason about the non-deterministic behavior they're trying to diagnose.
Things to follow up on...
-
Coordination latency compounds: As multi-agent systems scale, coordination latency grows from approximately 200ms with two agents to over 4 seconds with eight or more agents, creating bottlenecks that traditional monitoring doesn't surface.
-
Real-time guardrails at scale: Galileo's Luna-2 models enable running 10-20 sophisticated metrics simultaneously with sub-200ms latency at 100% sampling rates, addressing the challenge of protecting against failures without slowing production workflows.
-
The illusion of monitoring: Sanyal discussed how generic metrics create false confidence that doesn't account for real production failures, requiring domain-specific evaluations tailored to specific applications.
-
Hidden state complexity: Agent systems maintain internal variables, conversation history fragments, and reasoning steps that sit outside logs yet shape every decision, creating "memory drift" that's invisible to traditional observability.

