When Your Agent Fails at Step 47

When an agent fails at step 47 of a 50-step workflow, maybe verifying a hotel booking across a regional site that just changed its confirmation page, you don't get a stack trace. You get silence, or worse: a "success" message with subtly corrupted data that won't surface until a customer complains three days later.

Alex Reibman hit this wall last summer at a San Francisco hackathon. He built AI agents to scrape the web.

“

"They sucked," he said later. Not because the concept was flawed, but because when they failed, and they failed often, he had no way to understand why.

No logs that made sense. No way to replay what happened. No visibility into which decision went wrong or why the agent thought it succeeded when it clearly hadn't.

That frustration led him to start AgentOps (now Agency AI) with co-founders Adam Silverman and Shawn Qiu. The problem they're tackling: the tools we have for understanding software weren't designed for systems that make probabilistic decisions across hundreds of coordinated steps.

Traditional debugging assumes you can reproduce the error, trace the execution path, identify the faulty line. Agents don't work that way. They make decisions based on context that shifts, call tools that might return different results, coordinate with other agents whose state you can't see. The "code" isn't just code. It's a dialogue involving external tool calls, memory lookups, and reasoning steps that exist only as token probabilities. Root causes hide several steps upstream from where failures become visible.

Reibman recognized that agents need different observability infrastructure. AgentOps tracks what traditional monitoring can't capture: LLM calls, costs, latency, agent failures, multi-agent interactions, tool usage. All timestamped and traceable. More importantly, it provides the ability to rewind and replay agent runs with point-in-time precision, jumping to failure points and re-running from checkpoints without executing the full flow again.

Co-founder Adam Silverman frames the platform as:

“

"Mobile Device Management for AI agents—tracking every single digital movement."

Just as IT teams need to know which employees accessed what systems when, agent infrastructure needs similar visibility. Not for surveillance, but for accountability and debugging. When an agent makes a purchasing decision or extracts competitive data, you need to reconstruct exactly what it "saw" and why it acted.

“

"You want to understand whether your agent is going to go rogue and identify what limitations you can put in place. A lot of the work is being able to visually see where your guardrails exist, and whether the agent abides by them, before tossing them into production."

When we're orchestrating thousands of browser sessions across fragmented web surfaces at TinyFish, we've hit this exact wall. An agent successfully navigates 47 authentication steps, handling cookies, session tokens, regional redirects, then suddenly fails on step 48 because a site's CAPTCHA behavior changed in one geography. Traditional logs show "request failed." That tells us nothing about which reasoning step went wrong, what the agent actually saw on that page, or why it thought its approach would work.

You can't optimize what you can't see. You can't debug what you can't replay. You can't deploy with confidence what you can't monitor.

Reibman's work shows how young this field actually is. We're still building the basic infrastructure that software engineering took for granted decades ago: debuggers, profilers, monitoring tools. The difference is that agent systems require different approaches because they're not deterministic. We're in the infrastructure-building phase of agent development, not the application-building phase.

By making agent behavior visible and debuggable, infrastructure like AgentOps transforms agent development from exploratory process into engineering discipline. It means you can run regression tests when you update prompts. It means you can identify which 3% of edge cases cause 80% of failures. It means you can deploy agents knowing that when something breaks, and it will break, you'll understand why within minutes, not days.

For systems that need to work reliably at scale, this is table stakes. The step 47 problem isn't going away. But infrastructure that helps you understand and fix it separates demos from production.

Things to follow up on...

Systematic failure patterns: Research from 2024-2025 shows that multi-agent systems face specific failure modes including step repetition, premature terminations, incorrect verification, and state corruption from simultaneous updates.
The verification challenge: Academic research reveals that weak verification mechanisms are a significant contributor to multi-agent system failures, with creating universal verification remaining challenging even for experts.
Open-source observability approach: Langfuse, founded during Y Combinator Winter 2023, provides open-source observability for LLM applications with agent tracing capabilities that can be self-hosted or used with their managed cloud version.
Why traditional tools fail: The shift from linear software to probabilistic agent systems means traditional observability tools designed for simple prompt-response flows fall short when failures can occur at any point in agentic workflows.

Things to follow up on...

Systematic failure patterns: Research from 2024-2025 shows that multi-agent systems face specific failure modes including step repetition, premature terminations, incorrect verification, and state corruption from simultaneous updates.

The verification challenge: Academic research reveals that weak verification mechanisms are a significant contributor to multi-agent system failures, with creating universal verification remaining challenging even for experts.

Open-source observability approach: Langfuse, founded during Y Combinator Winter 2023, provides open-source observability for LLM applications with agent tracing capabilities that can be self-hosted or used with their managed cloud version.

Why traditional tools fail: The shift from linear software to probabilistic agent systems means traditional observability tools designed for simple prompt-response flows fall short when failures can occur at any point in agentic workflows.