The SRE Whose Dashboards Were Green While Everything Was Wrong

Dag Failover is, by his own admission, a name that has caused problems at every company he's ever worked for. "My first week at Google, three people asked if it was a handle," he says. It is not. It is, apparently, a corruption of a Norwegian surname that his great-grandfather decided not to correct at Ellis Island because he thought it sounded "more American."¹

Dag spent seven years building and maintaining microservice infrastructure before landing at a mid-size Pacific Northwest logistics company we'll call PalletStream, where he led platform reliability. In early 2025, PalletStream began deploying LLM-based agent workflows for procurement operations. Dag was the obvious person to own reliability for the new systems. His SRE instincts, he says, served him well for approximately eleven days.

You came to agent systems with a deep SRE background. What did you expect?

Dag: I thought it would be boring in a good way. I had my playbook: define SLIs, set SLOs, establish error budgets, instrument everything, iterate. I'd done this for forty-odd services. The procurement workflow was twelve steps touching five systems. That's not even complicated by microservice standards. I figured the hard part would be getting the ML team to agree on metric definitions.

Was it?

Dag: The hard part was that my entire monitoring philosophy assumed failures would raise their hand and introduce themselves. In microservices, a bad database call throws a 500. A timeout surfaces in your latency distribution. The signal is the failure. I had years of muscle memory built around that.

The procurement agent didn't work that way. It would query the wrong vendor catalog, select a perfectly valid SKU, generate a clean purchase order, and route it for approval. Every step: 200 OK. My dashboards were green. Gorgeous, actually. I had really nice Grafana panels.

When did you realize the dashboards were telling the truth about the wrong thing?

Dag: About three weeks in. Someone in procurement operations, a woman named Janet who'd been doing this work manually for eight years, flagged that we were ordering from a vendor we'd deprioritized six months ago. Not a huge deal on its own. But she pulled the thread and found eleven similar orders over two weeks.

I went back to the logs. Every single one had a clean execution trace. No errors, no retries, no anomalies. The agent had succeeded at placing incorrect orders. And I realized I had no instrument that could have caught that. My SLIs measured request success, latency, token consumption. I had a correctness SLO on paper, and the Google SRE Workbook actually names correctness as a valid SLO type², but I'd defined it as "did the output conform to the expected schema." Which it did. Perfectly. Every time.

How does that compare to the cascading failures you'd dealt with before?

Dag: In microservices, cascading failure is operational. A slow dependency causes queue buildup, causes timeouts, causes retries, amplifies load. You can see it happening. You have circuit breakers.

Agent cascading failure is semantic. A misclassification in step three doesn't trip a circuit breaker. It propagates as valid state into step four, which processes it faithfully, passes it to step five. The system is running at full speed in the wrong direction. By the time it surfaces, if it surfaces, the causal chain is buried across a dozen tool calls and three different systems. You know those old cartoons where the character runs off a cliff and keeps going until they look down? Except in this version, nobody looks down.

So you added human checkpoints.

Dag: Three of them. After vendor selection, after pricing confirmation, before final PO submission. End-to-end success rate went from depressing to genuinely good. Better than the manual process had been on accuracy.

That sounds like the story should end there.

Dag: It does, until you run the cost model. Per-transaction cost with the human checkpoints was higher than the fully manual process we'd replaced. Not by a lot, but enough that my VP asked a question I did not enjoy answering: "So we spent eight months building an automated system that costs more than the people it was supposed to replace?"

I tried to argue it was transitional, that we'd improve per-step reliability and eventually remove the checkpoints. And that might be true. But I've been doing reliability engineering long enough to know the shape of that curve. The SRE literature calls it the "march of nines": each additional nine of reliability costs as much as reaching the previous one.³ Getting from 95% to 99% per step is real engineering work. Getting from 99% to 99.9% is a different magnitude entirely. And the compounding math means you need those nines at every step simultaneously.

So now I think about it differently. The human checkpoints aren't scaffolding. They're infrastructure. They handle the failure mode my monitoring can't see.

That must be uncomfortable for someone whose career was built on eliminating manual work.

Dag: [long pause] Yeah. SRE culture has a concept called "toil," repetitive manual work that scales linearly and should be automated.⁴ My entire professional identity was built around eliminating toil. And now I'm arguing that certain kinds of human judgment in automated pipelines aren't toil at all. They're load-bearing.

“

The thing that keeps me up at night is that Janet, the procurement person who caught those bad orders, was on a list for the next round of headcount optimization. Because our dashboards showed the agent workflow running at 98% success. Which it was, if you measured step-level success. She was almost eliminated because the metric that would have justified keeping her didn't exist yet.

What metric should have existed?

Dag: End-to-end outcome correctness, verified against business intent. Which is a mouthful and also basically impossible to fully automate, because "business intent" isn't a schema you can validate against. It's Janet knowing that we stopped using that vendor six months ago.

I think the honest answer is that we need SLOs for things that don't look like errors. I don't fully know how to build those yet. But I know that measuring step-level success and calling it reliability is malpractice. It's the equivalent of measuring each individual microservice's uptime and concluding the user experience is fine.

What are you building toward?

Dag: Shorter chains, for one. We decomposed the twelve-step workflow into three segments of four steps each, with hard verification boundaries between them. That alone changed the math dramatically. And I'm trying to build what I'm calling "semantic SLOs," objectives defined not by "did the request succeed" but "did the outcome match what a domain expert would have done." Which requires domain experts in the loop, at least for calibration.

“

But honestly, I think the industry is going to learn this the hard way. Everyone's measuring step-level metrics and reporting great numbers. The end-to-end outcome rate is the number nobody's looking at. And by the time they look, they'll have already laid off Janet.

Dag Failover's views are his own and do not represent PalletStream, which also does not exist, because this is a composite character built from operational patterns that are, unfortunately, very real.⁵

We were unable to verify this claim, but we were also unable to disprove it, and it was too good not to include. ↩
Google SRE Workbook, "Implementing SLOs" — correctness is listed alongside availability, latency, freshness, and durability as valid SLO types. https://sre.google/workbook/implementing-slos/ ↩
The "march of nines" concept is well-documented in SRE literature. Each additional nine of reliability requires engineering effort equivalent to reaching the previous level. See Mandava, "The Agentic AI Infrastructure Landscape in 2025–2026," https://medium.com/@vinniesmandava/the-agentic-ai-infrastructure-landscape-in-2025-2026-a-strategic-analysis-for-tool-builders-b0da8368aee2 ↩
Google SRE Workbook defines toil as "the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows." https://sre.google/workbook/eliminating-toil/ ↩
The compound reliability patterns described here — 95% per-step reliability degrading to ~54% over twelve steps, 79% of multi-agent failures stemming from specification and coordination issues, and the observability paradox where 94% of organizations have tracing but quality remains the top barrier — are drawn from published evaluations and industry surveys. See LangChain, "State of Agent Engineering," https://www.langchain.com/state-of-agent-engineering; Maxim AI, "Ensuring AI Agent Reliability in Production," https://www.getmaxim.ai/articles/ensuring-ai-agent-reliability-in-production/ ↩

You came to agent systems with a deep SRE background. What did you expect?

Was it?

When did you realize the dashboards were telling the truth about the wrong thing?

How does that compare to the cascading failures you'd dealt with before?

So you added human checkpoints.

That sounds like the story should end there.

So now I think about it differently. The human checkpoints aren't scaffolding. They're infrastructure. They handle the failure mode my monitoring can't see.

That must be uncomfortable for someone whose career was built on eliminating manual work.

“

What metric should have existed?

What are you building toward?

“

We were unable to verify this claim, but we were also unable to disprove it, and it was too good not to include. ↩
Google SRE Workbook, "Implementing SLOs" — correctness is listed alongside availability, latency, freshness, and durability as valid SLO types. https://sre.google/workbook/implementing-slos/ ↩
The "march of nines" concept is well-documented in SRE literature. Each additional nine of reliability requires engineering effort equivalent to reaching the previous level. See Mandava, "The Agentic AI Infrastructure Landscape in 2025–2026," https://medium.com/@vinniesmandava/the-agentic-ai-infrastructure-landscape-in-2025-2026-a-strategic-analysis-for-tool-builders-b0da8368aee2 ↩
Google SRE Workbook defines toil as "the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows." https://sre.google/workbook/eliminating-toil/ ↩
The compound reliability patterns described here — 95% per-step reliability degrading to ~54% over twelve steps, 79% of multi-agent failures stemming from specification and coordination issues, and the observability paradox where 94% of organizations have tracing but quality remains the top barrier — are drawn from published evaluations and industry surveys. See LangChain, "State of Agent Engineering," https://www.langchain.com/state-of-agent-engineering; Maxim AI, "Ensuring AI Agent Reliability in Production," https://www.getmaxim.ai/articles/ensuring-ai-agent-reliability-in-production/ ↩

The SRE Whose Dashboards Were Green While Everything Was Wrong

You came to agent systems with a deep SRE background. What did you expect?

Was it?

When did you realize the dashboards were telling the truth about the wrong thing?

How does that compare to the cascading failures you'd dealt with before?

So you added human checkpoints.

That sounds like the story should end there.

That must be uncomfortable for someone whose career was built on eliminating manual work.

What metric should have existed?

What are you building toward?

Footnotes

You came to agent systems with a deep SRE background. What did you expect?

Was it?

When did you realize the dashboards were telling the truth about the wrong thing?

How does that compare to the cascading failures you'd dealt with before?

So you added human checkpoints.

That sounds like the story should end there.

That must be uncomfortable for someone whose career was built on eliminating manual work.

What metric should have existed?

What are you building toward?

Footnotes