CURRENT | Practitioner's Corner

Field Dispatch

What 98% Reliability Actually Costs

By Nora Kaplan— December 10, 2025

Feature image for article: What 98% Reliability Actually Costs

When a web agent platform reports 98% success rates across thousands of sites, that number looks clean on a dashboard. Behind it sits an on-call engineer staring at authentication failures, making a judgment call. Is this our proxy network or did these sites just rotate security certificates?

That distinction determines the next hour of work. It comes from pattern recognition you develop only by operating at scale. The cognitive load of maintaining high reliability accumulates in ways most organizations never see. What does 98% actually cost?

Field Dispatch

What 98% Reliability Actually Costs

By Nora Kaplan— December 10, 2025

When a web agent platform reports 98% success rates across thousands of sites, that number looks clean on a dashboard. Behind it sits an on-call engineer staring at authentication failures, making a judgment call. Is this our proxy network or did these sites just rotate security certificates?

That distinction determines the next hour of work. It comes from pattern recognition you develop only by operating at scale. The cognitive load of maintaining high reliability accumulates in ways most organizations never see. What does 98% actually cost?

Builder Profiles

When Your Auditor Doesn't Have a Category for Prompt Injection

By Rina Takahashi— December 10, 2025

Feature image for article: When Your Auditor Doesn't Have a Category for Prompt Injection

An attacker needs only your email address to hijack your ChatGPT session, replace banking details in responses, extract data from Copilot. You never opened the malicious email. The attack happened anyway. Some vendors patched it. Others called it intended functionality.

Now your auditor arrives with a compliance checklist written before these threats existed. They're looking for evidence of input validation, least-privilege access, audit trails. You're trying to explain prompt injection. RAG poisoning. Citation manipulation. The categories don't match. The frameworks haven't caught up. But the audit happens regardless.

Builder Profiles

When Your Auditor Doesn't Have a Category for Prompt Injection

By Rina Takahashi— December 10, 2025

An attacker needs only your email address to hijack your ChatGPT session, replace banking details in responses, extract data from Copilot. You never opened the malicious email. The attack happened anyway. Some vendors patched it. Others called it intended functionality.

Now your auditor arrives with a compliance checklist written before these threats existed. They're looking for evidence of input validation, least-privilege access, audit trails. You're trying to explain prompt injection. RAG poisoning. Citation manipulation. The categories don't match. The frameworks haven't caught up. But the audit happens regardless.

The Number That Matters

False Failures: Anywhere From 0% to 100%

When your automated tests fail, how many failures are real? Anywhere from none to all of them.

False failure rates in test automation span the entire spectrum. Some test runs produce zero false positives. Others flag nothing but phantoms. You can't know which category you're in until someone manually investigates every red flag.

One team spent 23 hours triaging 300 test failures. Nearly three full workdays of an engineer checking whether each failure represented an actual defect or just a flaky test, a timing issue, a locator that drifted when someone updated the UI. As test suites grow from hundreds to thousands of cases, even modest false failure rates compound into days of weekly verification work.

The Number That Matters

False Failures: Anywhere From 0% to 100%

When your automated tests fail, how many failures are real? Anywhere from none to all of them.

False failure rates in test automation span the entire spectrum. Some test runs produce zero false positives. Others flag nothing but phantoms. You can't know which category you're in until someone manually investigates every red flag.

One team spent 23 hours triaging 300 test failures. Nearly three full workdays of an engineer checking whether each failure represented an actual defect or just a flaky test, a timing issue, a locator that drifted when someone updated the UI. As test suites grow from hundreds to thousands of cases, even modest false failure rates compound into days of weekly verification work.

Timing mismatch:

Confidence collapse:

AI triage:

Environment drift:

Scale multiplication:

Field Notes from the Ecosystem

Configuration Management

Same Mistake, Three Weeks Apart

Enterprise Adoption

Trust Gap at Six Percent

Rate Limiting

Stability Overrides Published Schedules

Market Shift

Anthropic Takes 40% Enterprise Spend

Infrastructure Readiness

One in Five Reports Full Readiness

API Traffic

APIs Now 80% of Web Traffic

Practitioner Resources

How Raft Actually Works in Production Systems

Consensus theory meets reality. Google, Netflix, Kafka implementations show what leader election costs at scale.

Real Performance Numbers for Computer-Using Agents

OpenAI's CUA hits 87% on navigation vs Claude's 56%. Benchmark data that matters for model selection.

2024 Infrastructure Attack Patterns and Operational Impact

Where Engineering Orgs Actually Spend Resources in 2024

Practitioner Resources

How Raft Actually Works in Production SystemsConsensus theory meets reality. Google, Netflix, Kafka implementations show what leader election costs at scale.

Real Performance Numbers for Computer-Using AgentsOpenAI's CUA hits 87% on navigation vs Claude's 56%. Benchmark data that matters for model selection.

Quick links

2024 Infrastructure Attack Patterns and Operational Impact

Where Engineering Orgs Actually Spend Resources in 2024