Kit Voss Debugs Systems That Don't Know They're Broken

May 20, 2026

Kit Voss is not a real person, which is the only way we could get someone in this role to speak this candidly. But every detail here — the failure modes, the workarounds, the 2 AM trace archaeology — is drawn from published practitioner accounts and production data.¹²³ Kit is a Staff AI Platform Engineer at a B2B data enrichment company, responsible for roughly 80 million agent executions per month.

You came from traditional SRE. What surprised you about this role?

Kit: When a REST API breaks, it has the decency to tell you. You get a 500, a timeout, a stack trace. The whole system was designed around the assumption that failures are loud.² Agents are polite about it. An agent can run your entire pipeline, hit every step, return a beautifully formatted result, and the result is wrong. No error code. No signal. The dashboard stays green. I spent three hours last month debugging a pipeline that wasn't broken. No errors, no exceptions, logs looked fine. The output just happened to be completely fabricated.¹

What does "fabricated" look like in your system?

Kit: We run enrichment agents. Multi-step research workflows that build company profiles from dozens of sources. The specific failure that still keeps me up: an agent pulled data for two different companies with similar names and blended them into one profile. Revenue from Company A, headcount from Company B, a headquarters address that was actually a WeWork in a city neither company operates in.

The profile looked great. Confident, detailed, well-structured. A customer put it in a pitch deck. That's how we found out. The customer's prospect said "this isn't us."

How long between the bad run and the discovery?

“

Eleven days.

What does your monitoring stack actually catch?

Kit: Infrastructure. Latency, token usage, error rates, rate limits. Rate limits alone account for almost 60% of our actual errors.⁴ I have beautiful dashboards for all of that. What I don't have is a dashboard that says "this agent confidently returned the wrong company." Because that registers as a 200 OK. The most dangerous response in production.²

So I built something. I'm a little embarrassed by it. A Slack bot. It watches our downstream deduplication pipeline. When dedup rates spike, it pings me, because a spike usually means agents are creating near-duplicate records, which usually means they're blending entities. It's a canary. It's also held together with a cron job and optimism.

You're inferring data quality from a side effect in a different pipeline.

Kit: Because no observability tool I've found checks the outcome. They all check the process. Did the agent run? Yes. Did it complete? Yes. Did the LLM return a response? Yes. Did the tools get called? Yes. Was the result correct? Shrug emoji. That question lives in a different system entirely. If it lives anywhere.³

When leadership asks for the success rate, what do you say?

Kit: I say 96.2%, and then I feel like a liar. That number comes from the agent's self-reported completion status. The agent says "done," we count it as a success. The actual error rate, meaning how often the output is materially wrong, I genuinely don't know. Could be 4%. Could be 12%. I sample about 200 traces a week manually, which at 80 million runs is... I'll let you do the math on that coverage.

“

The thing I can't explain to my VP without sounding like I'm undermining my own team: our success metric measures whether the agent thinks it succeeded.⁵

What does manual trace review actually look like?

Kit: You pull up a trace, and if you're lucky, it's got full tool call arguments, the retrieved context snapshots, the prompt template version, all of it. If you're unlucky, someone decided tool arguments were too sensitive to log, and now you're reconstructing what happened from the output alone. Debugging by speculation.

But even with full traces, the hard part is the causal chain. You see a bad output at step 8. You trace it back. The actual failure was at step 3. The agent chose interpretation A when the context required interpretation B, then executed five more steps confidently on the wrong foundation.¹ The stack trace points nowhere useful. You're doing archaeology.

And the retries mask it. We retry on failure, obviously. But sometimes the first call returns a 200 with garbage. The retry logic sees "success" and moves on. So you end up with a system that has, let's say, a 50% real error rate presenting as 95% success, because retries hide the failures and the failures that get through don't look like failures.²

What would adequate instrumentation look like?

Kit: I've thought about this more than is probably healthy. Run every agent twice. Different identity, different session. Diff the outputs. If two independent runs produce materially different answers for the same query, something's wrong. The technology isn't hard. The hard part is distinguishing a real divergence from normal model non-determinism. Two runs returning slightly different phrasing is fine. Two runs returning different revenue numbers is not. I don't have a clean rule for where that boundary sits. Nobody does, as far as I can tell.

The Datadog report says 69% of companies now use three or more models. Does model churn affect you?

Kit: Every new model is a new regression surface. We onboarded one last quarter. Better benchmarks, lower cost, everyone was excited. Two weeks in, I noticed our downstream correction rate had crept up 3%. Not enough to trigger any alert. Just enough to make me suspicious. Turned out the new model was more aggressive about inferring missing data rather than flagging it as unavailable. Technically a style difference. Operationally, it meant we were shipping guesses as facts.⁴

What would you want people outside this role to understand?

Kit: That the gap between "the agent ran" and "the agent was right" is where I live. And right now, almost nobody's building tools for that gap. We instrument the process beautifully. We don't instrument the truth at all.

Kit Voss is fictional. Everything Kit described is happening in production systems right now.

Carson Roell, "Why AI Agents Fail Silently — And How to Fix It," Dev.to, March 28, 2026. https://dev.to/carsonroelldebug/why-ai-agents-fail-silently-and-how-to-fix-it-j6d ↩ ↩² ↩³
Runcycles.io, "AI Agent Silent Failures: Why 200 OK Is the Most Dangerous Response in Production," March 26, 2026. https://runcycles.io/blog/ai-agent-silent-failures-why-200-ok-is-the-most-dangerous-response ↩ ↩² ↩³ ↩⁴
Latitude, "Detecting AI Agent Failure Modes in Production," March 26, 2026. https://latitude.so/blog/ai-agent-failure-detection-guide ↩ ↩²
Datadog, "State of AI Engineering 2026." https://www.datadoghq.com/state-of-ai-engineering/ ↩ ↩²
"5 Silent Failure Patterns I Keep Finding in Production AI Systems," Dev.to, May 2026. https://dev.to/temurkhan13/5-silent-failure-patterns-i-keep-finding-in-production-ai-systems-4fl0 ↩

You came from traditional SRE. What surprised you about this role?

What does "fabricated" look like in your system?

The profile looked great. Confident, detailed, well-structured. A customer put it in a pitch deck. That's how we found out. The customer's prospect said "this isn't us."

How long between the bad run and the discovery?

“

Eleven days.

What does your monitoring stack actually catch?

You're inferring data quality from a side effect in a different pipeline.

When leadership asks for the success rate, what do you say?

“

The thing I can't explain to my VP without sounding like I'm undermining my own team: our success metric measures whether the agent thinks it succeeded.⁵

What does manual trace review actually look like?

What would adequate instrumentation look like?

The Datadog report says 69% of companies now use three or more models. Does model churn affect you?

What would you want people outside this role to understand?

Kit Voss is fictional. Everything Kit described is happening in production systems right now.

Carson Roell, "Why AI Agents Fail Silently — And How to Fix It," Dev.to, March 28, 2026. https://dev.to/carsonroelldebug/why-ai-agents-fail-silently-and-how-to-fix-it-j6d ↩ ↩² ↩³
Runcycles.io, "AI Agent Silent Failures: Why 200 OK Is the Most Dangerous Response in Production," March 26, 2026. https://runcycles.io/blog/ai-agent-silent-failures-why-200-ok-is-the-most-dangerous-response ↩ ↩² ↩³ ↩⁴
Latitude, "Detecting AI Agent Failure Modes in Production," March 26, 2026. https://latitude.so/blog/ai-agent-failure-detection-guide ↩ ↩²
Datadog, "State of AI Engineering 2026." https://www.datadoghq.com/state-of-ai-engineering/ ↩ ↩²
"5 Silent Failure Patterns I Keep Finding in Production AI Systems," Dev.to, May 2026. https://dev.to/temurkhan13/5-silent-failure-patterns-i-keep-finding-in-production-ai-systems-4fl0 ↩

Kit Voss Debugs Systems That Don't Know They're Broken

You came from traditional SRE. What surprised you about this role?

What does "fabricated" look like in your system?

How long between the bad run and the discovery?

What does your monitoring stack actually catch?

You're inferring data quality from a side effect in a different pipeline.

When leadership asks for the success rate, what do you say?

What does manual trace review actually look like?

What would adequate instrumentation look like?

The Datadog report says 69% of companies now use three or more models. Does model churn affect you?

What would you want people outside this role to understand?

Footnotes

You came from traditional SRE. What surprised you about this role?

What does "fabricated" look like in your system?

How long between the bad run and the discovery?

What does your monitoring stack actually catch?

You're inferring data quality from a side effect in a different pipeline.

When leadership asks for the success rate, what do you say?

What does manual trace review actually look like?

What would adequate instrumentation look like?

The Datadog report says 69% of companies now use three or more models. Does model churn affect you?

What would you want people outside this role to understand?

Footnotes