Flaky Tests and the Prehistory of Agent Unreliability

A decade of coping with flaky browser tests built infrastructure for environmental chaos. Agents reintroduce non-determinism where those tools can't reach.

By Rina Takahashi— May 21, 2026

Flaky Tests and the Prehistory of Agent Unreliability

A decade of coping with flaky browser tests built infrastructure for environmental chaos. Agents reintroduce non-determinism where those tools can't reach.

In 2016, Google published a finding that should have been embarrassing: 84% of pass-to-fail transitionsin their test suites weren't real bugs. The test passed yesterday, failed today, would pass again tomorrow. Nothing in the code had changed. The environment had simply shifted underneath it.

The assumption was reasonable: a deterministic script, given the same inputs, should produce the same outputs. Web browsers made that assumption wrong. And the decade software engineers spent coping with that wrongness turns out to be a useful prehistory for a problem that's about to get harder.

When Selenium 2.0 shipped in 2011, it introduced wait mechanisms as formal tools because web pages don't load instantaneously. A test that clicks a button before the button renders will fail for reasons that have nothing to do with the button. Implicit waits polled the page on a timer. Explicit waits watched for specific conditions, like an element becoming visible or clickable. Both tried to synchronize a deterministic script with an environment that refused to hold still. They helped. They also introduced new failure modes. Selenium's own documentation warned against mixing the two types because combining them produced unpredictable, compounding timeouts. The cure generating its own disease, as it does.

By 2018, the community had identified another culprit: brittle locators. Tests that found page elements by CSS class or position broke whenever a designer reorganized a layout. React Testing Library popularized stable hooks, like data-testid attributes, that survived visual changes. Give the test its own addressing system instead of asking it to navigate a surface designed for human eyes.

Then Microsoft released Playwright in January 2020, and its core design choice was revealing. Auto-waiting ran through the entire architecture. Every action verified visibility, stability, and interactability before executing. The framework absorbed the timing problem into its foundation rather than handing developers tools to manage it themselves. Containerized test environments addressed yet another layer: shared state between runs, dependency drift, the slow accumulation of environmental grime.

Each layer targeted a specific source of chaos. Waits for timing. Stable locators for structural fragility. Auto-waiting for the synchronization gap. Containers for environmental drift. Each layer genuinely reduced the problem. And notice the shared assumption underneath all of them: the script knows what it wants to do. The non-determinism is in the world around it.

AI agents break that assumption cleanly.

The τ-bench benchmark measures what happens when you give the same agent the same task repeatedly under identical conditions. GPT-4o on a retail task: 61% single-run success. Run it eight times and ask whether it succeeded every time? Twenty-five percent. That gap has nothing to do with environmental noise. The model's own reasoning varies, the planning layer choosing different paths through the same problem on each attempt.

Auto-waiting assumes the agent knows which element it's waiting for. Stable locators assume the planning layer will reach for the same element each run. Container isolation seals off the environment, which is the wrong perimeter when the variance lives inside the model's token sampling. ReliabilityBench, published in January 2026, formalized the point: single-run accuracy fails to capture the reliability characteristics that production deployment actually requires.

The testing world spent a decade building coping infrastructure for non-determinism in the environment surrounding a deterministic script. That work was hard-won and real. Agents now face non-determinism at a depth those tools were never designed to reach. What does coping infrastructure look like when the inconsistency originates inside the script itself?

AI agents break that assumption cleanly.