CURRENT | Practitioner's Corner

Field Dispatch

What 98% Reliability Actually Costs

By Nora Kaplan— December 10, 2025

Feature image for article: What 98% Reliability Actually Costs

When a web agent platform reports 98% success rates across thousands of sites, that number looks clean on a dashboard. Behind it sits an on-call engineer staring at authentication failures, making a judgment call. Is this our proxy network or did these sites just rotate security certificates?

That distinction determines the next hour of work. It comes from pattern recognition you develop only by operating at scale. The cognitive load of maintaining high reliability accumulates in ways most organizations never see. What does 98% actually cost?

Field Dispatch

What 98% Reliability Actually Costs

By Nora Kaplan— December 10, 2025

When a web agent platform reports 98% success rates across thousands of sites, that number looks clean on a dashboard. Behind it sits an on-call engineer staring at authentication failures, making a judgment call. Is this our proxy network or did these sites just rotate security certificates?

That distinction determines the next hour of work. It comes from pattern recognition you develop only by operating at scale. The cognitive load of maintaining high reliability accumulates in ways most organizations never see. What does 98% actually cost?

Builder Profiles

When Your Auditor Doesn't Have a Category for Prompt Injection

By Rina Takahashi— December 10, 2025

Feature image for article: When Your Auditor Doesn't Have a Category for Prompt Injection

An attacker needs only your email address to hijack your ChatGPT session, replace banking details in responses, extract data from Copilot. You never opened the malicious email. The attack happened anyway. Some vendors patched it. Others called it intended functionality.

Now your auditor arrives with a compliance checklist written before these threats existed. They're looking for evidence of input validation, least-privilege access, audit trails. You're trying to explain prompt injection. RAG poisoning. Citation manipulation. The categories don't match. The frameworks haven't caught up. But the audit happens regardless.

Builder Profiles

When Your Auditor Doesn't Have a Category for Prompt Injection

By Rina Takahashi— December 10, 2025

An attacker needs only your email address to hijack your ChatGPT session, replace banking details in responses, extract data from Copilot. You never opened the malicious email. The attack happened anyway. Some vendors patched it. Others called it intended functionality.

Now your auditor arrives with a compliance checklist written before these threats existed. They're looking for evidence of input validation, least-privilege access, audit trails. You're trying to explain prompt injection. RAG poisoning. Citation manipulation. The categories don't match. The frameworks haven't caught up. But the audit happens regardless.

The Number That Matters

False Failures: Anywhere From 0% to 100%

When your automated tests fail, how many failures are real? Anywhere from none to all of them.

False failure rates in test automation span the entire spectrum. Some test runs produce zero false positives. Others flag nothing but phantoms. You can't know which category you're in until someone manually investigates every red flag.

One team spent 23 hours triaging 300 test failures. Nearly three full workdays of an engineer checking whether each failure represented an actual defect or just a flaky test, a timing issue, a locator that drifted when someone updated the UI. As test suites grow from hundreds to thousands of cases, even modest false failure rates compound into days of weekly verification work.

The Number That Matters

False Failures: Anywhere From 0% to 100%

When your automated tests fail, how many failures are real? Anywhere from none to all of them.

False failure rates in test automation span the entire spectrum. Some test runs produce zero false positives. Others flag nothing but phantoms. You can't know which category you're in until someone manually investigates every red flag.

One team spent 23 hours triaging 300 test failures. Nearly three full workdays of an engineer checking whether each failure represented an actual defect or just a flaky test, a timing issue, a locator that drifted when someone updated the UI. As test suites grow from hundreds to thousands of cases, even modest false failure rates compound into days of weekly verification work.

Timing mismatch:

Page load speed lagging behind test execution throws timeout exceptions on perfectly functional applications, creating classic false failures from script-browser interaction issues.

Confidence collapse:

False failures don't just waste time. They undermine trust in automation itself, causing teams to ignore or deprioritize test results entirely.

AI triage:

Defect prediction tools reduced one team's investigation time by 70%, from 23 hours to 7 hours for the same 300 failures.

Environment drift:

Testing environment inconsistencies, locator changes from UI updates, and network variability all contribute to false positives requiring manual investigation.

Scale multiplication:

A 10% false failure rate sounds manageable until you're running 5,000 tests daily and suddenly triaging 500 phantom failures every morning.

Field Notes from the Ecosystem

Cloudflare went down twice in three weeks. Both times from routine config changes. The second outage happened because the fixes from the first one weren't deployed yet. That's infrastructure in December.

Enterprises keep discovering gaps between what they're buying and what they trust. 6% fully trust AI agents with core work. 20% say their infrastructure is ready. 40% of enterprise LLM spend shifted to Anthropic this year. OpenAI called code red.

API traffic now dominates the web at 80%. Rate limiters can't keep up. Here's what we're tracking.

Field Notes from the Ecosystem

Cloudflare went down twice in three weeks. Both times from routine config changes. The second outage happened because the fixes from the first one weren't deployed yet. That's infrastructure in December.

Enterprises keep discovering gaps between what they're buying and what they trust. 6% fully trust AI agents with core work. 20% say their infrastructure is ready. 40% of enterprise LLM spend shifted to Anthropic this year. OpenAI called code red.

API traffic now dominates the web at 80%. Rate limiters can't keep up. Here's what we're tracking.

Configuration Management

Same Mistake, Three Weeks Apart

Cloudflare's December 5 outage affected 28% of HTTP traffic. Root cause: routine config change. Same as November 18. Safeguards identified after first incident weren't deployed yet. "Would have helped prevent today's impact but, unfortunately, we have not finished deploying them."

Enterprise Adoption

Trust Gap at Six Percent

HBR surveyed 603 business leaders on AI agents. 6% fully trust them with core processes. 43% limit to operational tasks, 39% confine to supervised work. Security and privacy concerns dominate. Gap between spending and deployment keeps widening.

Rate Limiting

Stability Overrides Published Schedules

Atlassian began enforcing rate limits August 2025 but reserves right to enforce earlier "where apps are highly impacting platform stability." Limits factor app identity, tenant, request volume, product edition, user count. Burst and quota periods tracked independently.

Market Shift

Anthropic Takes 40% Enterprise Spend

Anthropic now captures 40% of enterprise LLM spend versus 24% last year, unseating OpenAI. Serves 300,000+ business customers, large accounts growing 7x. OpenAI declared "code red" December 2, pausing side projects to refocus on core experience.

Infrastructure Readiness

One in Five Reports Full Readiness

20% of organizations say technology infrastructure fully supports agentic AI for core processes. 15% report data and systems ready, 12% have risk controls in place. Composite index: 27% leaders, 50% followers, 24% laggards. Spending outpaces preparation.

API Traffic

APIs Now 80% of Web Traffic

API traffic exceeds 80% of all web traffic per Akamai's State of the Internet report. 75% of API issues stem from mishandled limits. Modern APIs spanning multiple gateways create reliability problems when nodes track limits independently.

Practitioner Resources

How Raft Actually Works in Production Systems

Consensus theory meets reality. Google, Netflix, Kafka implementations show what leader election costs at scale.

Real Performance Numbers for Computer-Using Agents

OpenAI's CUA hits 87% on navigation vs Claude's 56%. Benchmark data that matters for model selection.

2024 Infrastructure Attack Patterns and Operational Impact

Where Engineering Orgs Actually Spend Resources in 2024

Practitioner Resources

How Raft Actually Works in Production SystemsConsensus theory meets reality. Google, Netflix, Kafka implementations show what leader election costs at scale.

Real Performance Numbers for Computer-Using AgentsOpenAI's CUA hits 87% on navigation vs Claude's 56%. Benchmark data that matters for model selection.

Quick links

2024 Infrastructure Attack Patterns and Operational Impact

Where Engineering Orgs Actually Spend Resources in 2024