CURRENT | Foundations

Field Guide

Testing Tools Are Not Production Infrastructure

By Rina Takahashi— November 4, 2025

Feature image for article: Testing Tools Are Not Production Infrastructure

Watching teams discover their Playwright setup won't scale past demos—it's like watching clouds gather before rain hits. You know what's coming. The industry lumps "browser automation" into one bucket when it's really two architectures solving completely different problems. What's ahead: way more companies hitting this wall as agents move from proof-of-concept to actual production. The gap isn't subtle when you're suddenly managing thousands of browser sessions and everything that seemed simple in staging breaks against real sites.

Field Guide

Testing Tools Are Not Production Infrastructure

By Rina Takahashi— November 4, 2025

Watching teams discover their Playwright setup won't scale past demos—it's like watching clouds gather before rain hits. You know what's coming. The industry lumps "browser automation" into one bucket when it's really two architectures solving completely different problems. What's ahead: way more companies hitting this wall as agents move from proof-of-concept to actual production. The gap isn't subtle when you're suddenly managing thousands of browser sessions and everything that seemed simple in staging breaks against real sites.

Web Archaeology

When Developer Tools Made Selectors Disposable

By Rina Takahashi— November 4, 2025

Feature image for article: When Developer Tools Made Selectors Disposable

Last week someone asked why their web automation keeps breaking even when sites look exactly the same. Got me thinking about this frontend tooling shift around 2015 that nobody building automation saw coming. Component frameworks solved real developer problems by making CSS selectors disposable—regenerating them on every build. Brilliant for internal teams, absolute nightmare for systematic monitoring.

The forecast: this gap between "site looks stable" and "selectors regenerate constantly" keeps widening as more teams adopt modern frontend stacks. Not about blaming developers, they optimized for the right use case. Just need to understand why reliable web automation now requires infrastructure depth most enterprises completely underestimate.

Web Archaeology

When Developer Tools Made Selectors Disposable

By Rina Takahashi— November 4, 2025

Last week someone asked why their web automation keeps breaking even when sites look exactly the same. Got me thinking about this frontend tooling shift around 2015 that nobody building automation saw coming. Component frameworks solved real developer problems by making CSS selectors disposable—regenerating them on every build. Brilliant for internal teams, absolute nightmare for systematic monitoring.

The forecast: this gap between "site looks stable" and "selectors regenerate constantly" keeps widening as more teams adopt modern frontend stacks. Not about blaming developers, they optimized for the right use case. Just need to understand why reliable web automation now requires infrastructure depth most enterprises completely underestimate.

In Dialogue With Complexity

An Interview With Rate Limiting (Who Insists You've Been Misunderstanding Them)

In Dialogue With Complexity

An Interview With Rate Limiting (Who Insists You've Been Misunderstanding Them)

Pattern Recognition from the Field

Why Agent Demos Keep Failing in Production

I keep seeing the same thing: agents that shine in demos fall apart when you run them repeatedly. Superface's benchmarks tell the story. Simple CRM tasks like creating Salesforce leads or updating HubSpot pipelines fail 75% of the time when agents chain them together. Single actions might work half the time. String six together and you're looking at 10-20% success rates.

Carnegie Mellon found even the best models complete only 30% of office tasks autonomously. The arithmetic is brutal. A 20% error rate per action means a five-step workflow has a 32% chance of working end-to-end.

Companies are building agents without evaluation infrastructure. They're treating probabilistic systems like deterministic code. What actually works: build your eval framework before your agent. Design for graceful failure from day one. Use specialist agents handling 10-20 tools maximum instead of one super-agent trying to do everything. The demo-to-production gap isn't something to solve. It's a constraint to design around.

Pattern Recognition from the Field

Why Agent Demos Keep Failing in Production

I keep seeing the same thing: agents that shine in demos fall apart when you run them repeatedly. Superface's benchmarks tell the story. Simple CRM tasks like creating Salesforce leads or updating HubSpot pipelines fail 75% of the time when agents chain them together. Single actions might work half the time. String six together and you're looking at 10-20% success rates.

Carnegie Mellon found even the best models complete only 30% of office tasks autonomously. The arithmetic is brutal. A 20% error rate per action means a five-step workflow has a 32% chance of working end-to-end.

Companies are building agents without evaluation infrastructure. They're treating probabilistic systems like deterministic code. What actually works: build your eval framework before your agent. Design for graceful failure from day one. Use specialist agents handling 10-20 tools maximum instead of one super-agent trying to do everything. The demo-to-production gap isn't something to solve. It's a constraint to design around.

Agent washing:

Gartner estimates only 130 of thousands of self-proclaimed agentic AI vendors are legitimate, with most rebranding existing automation without meaningful autonomous capabilities.

Infrastructure reality:

42% of enterprises need eight-plus data sources for agent deployment, but 86% still require tech stack upgrades they haven't started yet.

Cost economics:

Long agent conversations become prohibitively expensive at scale, with API pricing spanning $0.25 to $75 per million tokens depending on model choice.

Organizational blindspot:

Only 17% of surveyed leaders cite organizational change as a challenge, yet it's consistently the primary deployment blocker once technology is ready.

Cancellation wave:

Gartner predicts over 40% of agentic AI projects will be cancelled by end of 2027 due to escalating costs and unclear business value.

Questions Worth Asking

You learn which questions matter by watching things break in production. Feature lists look great in vendor decks. Demos impress in Monday meetings. Then you deploy, and different factors determine whether you're sleeping through the night or debugging at 3am.

These six questions cut through the marketing. They're what we ask when evaluating tools because we've seen the patterns. What sounds minor in a demo becomes critical under load. What vendors gloss over becomes your team's operational burden.

Ask these before you commit. Not as a comprehensive checklist, but as a way to think more clearly about what actually predicts success when you're building at scale.

Questions Worth Asking

You learn which questions matter by watching things break in production. Feature lists look great in vendor decks. Demos impress in Monday meetings. Then you deploy, and different factors determine whether you're sleeping through the night or debugging at 3am.

These six questions cut through the marketing. They're what we ask when evaluating tools because we've seen the patterns. What sounds minor in a demo becomes critical under load. What vendors gloss over becomes your team's operational burden.

Ask these before you commit. Not as a comprehensive checklist, but as a way to think more clearly about what actually predicts success when you're building at scale.

Latency Constraints

What's Your Absolute Maximum Latency?

This number determines everything else. A chatbot can wait two seconds. Self-driving systems need milliseconds. Your answer dictates edge versus cloud, model size limits, whether your cost structure even pencils out. Get this wrong and you're rebuilding the entire stack.

Recovery Path

Can You Actually Roll Back?

When production breaks at 2am, what happens? Simple rollback sounds perfect until you've deleted a database column. Some teams burn weeks enabling rollbacks that never work cleanly. Others fix forward. Know which your architecture supports before you're debugging in the dark.

Environment Drift

Do Your Environments Match?

Code works in development, fails in production. Not a mystery. Environmental drift. The worst debugging sessions trace back to subtle differences nobody documented. If CI, staging, and production aren't genuinely similar, you're guessing whether deployments will work.

Vendor Health

Will This Still Exist Next Year?

Every component inherits its vendor's health. Can you hire people who know this technology? Will it get security patches? Components become liabilities when vendors fade. Evaluate the ecosystem and community momentum, not just today's feature list.

Production Gap

How Far From Demo to Real Deployment?

Monday's impressive demo becomes Friday's emergency. Missing GPU drivers. Unhandled edge cases. Dependencies that work on one laptop. The gap between "works in the demo" and "runs reliably at scale" tells you whether you're weeks away or months.

Team Bandwidth

Does Your Team Have These Skills?

ML teams excel at models, not production infrastructure. Before choosing deployment approaches, assess actual engineering capacity. The technically superior solution your team can't maintain becomes the wrong solution. Match architecture to team capabilities, not aspirations.