Practitioner's Corner
Lessons from the field—what we see building at scale

Practitioner's Corner
Lessons from the field—what we see building at scale

What 98% Reliability Actually Costs

When a web agent platform reports 98% success rates across thousands of sites, that number looks clean on a dashboard. Behind it sits an on-call engineer staring at authentication failures, making a judgment call. Is this our proxy network or did these sites just rotate security certificates?
That distinction determines the next hour of work. It comes from pattern recognition you develop only by operating at scale. The cognitive load of maintaining high reliability accumulates in ways most organizations never see. What does 98% actually cost?
What 98% Reliability Actually Costs
When a web agent platform reports 98% success rates across thousands of sites, that number looks clean on a dashboard. Behind it sits an on-call engineer staring at authentication failures, making a judgment call. Is this our proxy network or did these sites just rotate security certificates?
That distinction determines the next hour of work. It comes from pattern recognition you develop only by operating at scale. The cognitive load of maintaining high reliability accumulates in ways most organizations never see. What does 98% actually cost?

When Your Auditor Doesn't Have a Category for Prompt Injection

An attacker needs only your email address to hijack your ChatGPT session, replace banking details in responses, extract data from Copilot. You never opened the malicious email. The attack happened anyway. Some vendors patched it. Others called it intended functionality.
Now your auditor arrives with a compliance checklist written before these threats existed. They're looking for evidence of input validation, least-privilege access, audit trails. You're trying to explain prompt injection. RAG poisoning. Citation manipulation. The categories don't match. The frameworks haven't caught up. But the audit happens regardless.

When Your Auditor Doesn't Have a Category for Prompt Injection
An attacker needs only your email address to hijack your ChatGPT session, replace banking details in responses, extract data from Copilot. You never opened the malicious email. The attack happened anyway. Some vendors patched it. Others called it intended functionality.
Now your auditor arrives with a compliance checklist written before these threats existed. They're looking for evidence of input validation, least-privilege access, audit trails. You're trying to explain prompt injection. RAG poisoning. Citation manipulation. The categories don't match. The frameworks haven't caught up. But the audit happens regardless.
The Number That Matters
When your automated tests fail, how many failures are real? Anywhere from none to all of them.
False failure rates in test automation span the entire spectrum. Some test runs produce zero false positives. Others flag nothing but phantoms. You can't know which category you're in until someone manually investigates every red flag.
One team spent 23 hours triaging 300 test failures. Nearly three full workdays of an engineer checking whether each failure represented an actual defect or just a flaky test, a timing issue, a locator that drifted when someone updated the UI. As test suites grow from hundreds to thousands of cases, even modest false failure rates compound into days of weekly verification work.
Practitioner Resources






