Testing Tools Are Not Production Infrastructure

Playwright handles proofs-of-concept beautifully. Three weeks into production, teams are rebuilding from scratch. Not because Playwright failed—it did exactly what it was designed to do. The problem is what it was designed to do and what they needed are different categories entirely.

At TinyFish, where we build enterprise web agent infrastructure, this misidentification shows up constantly. Teams conflate browser automation tools with web automation infrastructure because both control browsers, both extract data, both handle navigation. The industry uses "browser automation" as a catch-all term—collapsing testing tools and production infrastructure into the same category because both involve controlling browsers. That linguistic sloppiness costs enterprises months of engineering time and failed deployments.

The distinction becomes visible when you try to run production operations at scale. By then, you've already committed architecture and engineering resources to the wrong foundation.

Why the Confusion Persists

Browser automation tools were built for testing. Selenium emerged in 2004 for QA teams automating manual test cases. Google built Puppeteer in 2017 to control Chrome for testing. Microsoft's Playwright arrived in 2020 with better parallel execution—still for testing.

The architecture reflects this origin: run a script against your own staging environment, verify behavior, tear down.

These tools excel at their purpose. The confusion happens because "controlling a browser" sounds like what you need for web agents operating in production across thousands of adversarial third-party sites. Testing your own staging environment and operating at scale across the live web are different problems requiring different architectures. The industry's terminology obscures this until you're already in production—and by then, you're rebuilding.

How to Tell Them Apart

Three characteristics distinguish the categories, visible only when you're running production operations:

Resource architecture. Testing tools consume resources in ways that make scaling difficult. When Structify moved from browser automation tools to infrastructure services, they reported:

“

"going from gigabytes of RAM being eaten up to basically zero"

That's not optimization. That's architectural difference.

Tools are built for sequential test runs. Infrastructure is built for concurrent production operations managing thousands of browser sessions simultaneously.

Anti-bot handling. Testing tools don't need sophisticated evasion—they're running against your own staging environment. Production agents face active resistance: behavioral fingerprinting, CAPTCHAs, rate limits, regional variations.

We see this pattern constantly at TinyFish: teams arrive with Playwright implementations that worked perfectly in staging, now failing in production because third-party sites detect automation immediately.

Here's what separates practitioners from evaluators: practitioners know that "handling CAPTCHAs" isn't a feature checkbox. It's an ongoing operational challenge requiring session management, detection pattern adaptation, and regional variation handling. Tools treat it as something you might add. Infrastructure treats it as core architecture.

Observability for reliability. When a test fails, you rerun it. When a production agent monitoring competitor pricing across 10,000 sites fails at 3am, you need to know what broke, why, and whether your SLA is at risk. Infrastructure provides SLA-driven monitoring, distributed tracing, session replay. Tools provide test results.

The Real Cost

Teams choose browser automation tools because they work brilliantly in demos. Then they hit production scale and discover the tool consuming resources it wasn't architected to manage, getting blocked immediately because behavioral patterns scream "automation," or failing without observability to debug what happened.

Category mismatch, not tool quality. And it costs enterprises months of engineering time rebuilding from scratch after they've already committed architectural decisions that assume their foundation can handle production load.

The evaluation framework isn't feature lists or vendor claims. It's operational reality: Are you testing your own systems in controlled environments, or are you operating at scale across the adversarial web where reliability determines business outcomes? That distinction determines which category you need—before you've committed engineering resources to the wrong foundation.

Things to follow up on...

Playwright's parallel execution advantage: While Playwright supports parallel execution by default, making it superior to Puppeteer and Selenium for concurrent testing, this capability still reflects testing architecture rather than production infrastructure requirements for managing thousands of simultaneous browser sessions.
Bot detection behavioral patterns: Automated scripts reveal themselves through uniform scrolling patterns, consistent click intervals, and predictable request timing—detection mechanisms that testing tools weren't designed to evade because they operate in controlled environments rather than adversarial production contexts.
Infrastructure observability requirements: Modern observability extends beyond traditional monitoring by providing continuous actionable insights from system signals, with one organization cutting incidents by 90% and reducing response times from hours to seconds through comprehensive real-time infrastructure visibility.
Resource consumption at scale: Running parallel browser automation quickly consumes CPU and memory on local or small CI machines, with Docker deployments requiring careful resource limits like MAX_CONCURRENT_SESSIONS=10 and memory caps at 2GB to prevent system overload during concurrent operations.