Where Intelligence Earns Its Keep in Browser Automation

A browser automation workflow runs for the hundredth time. It logs in, searches, extracts results. On the first run, Stagehand v3 called a language model to figure out each interaction. On the hundredth run, it doesn't. The framework cached what it learned and replays it deterministically, like a Playwright script that wrote itself. AI only re-engages when something breaks. A layout shifts. A selector stops resolving. The page isn't what it used to be.

Browser Use, another open-source framework solving the same problem, does something fundamentally different. Every run, every step, the page state goes to a language model, which reasons about what to do, executes, observes the result, and reasons again. The hundredth run costs roughly what the first one did.

Two frameworks. Same task. One gets cheaper and faster with repetition. The other stays flat. The difference encodes a claim about the world.

Stagehand's caching system is the artifact worth examining closely. When an AI-driven action succeeds, the system records the path and replays it without model involvement next time. Over successive runs, workflows converge toward the performance profile of hand-written automation. Practitioners report Stagehand at 5–10 seconds per run and Browser Use at 15–30, against a raw Playwright baseline of under 2. As Stagehand's cache fills, its developers say performance converges toward that Playwright baseline — workflows that once required inference gradually stop needing it. But when the site redesigns its checkout flow, the Playwright script breaks. Stagehand re-engages AI, finds a new path, caches it. Browser Use barely notices, because it was already reasoning from scratch.

The tension is real and unresolved. Continuous reasoning buys resilience to novelty. Selective reasoning buys a cost curve that bends downward. Each framework is optimizing for a different kind of uncertainty.

And yet they agree on more than you'd expect. Both independently migrated from Playwright to raw CDP, the Chrome DevTools Protocol. Playwright was built for testing. Its abstraction layer adds latency and hides page state that models need to reason well. Speaking the browser's native protocol gives both frameworks richer context and faster execution. They converged on the foundation and then diverged on everything above it.

That divergence shows up most clearly in how each framework divides labor between human and machine. Stagehand exposes AI primitives that developers weave into otherwise deterministic scripts. The developer writes the automation logic and designates specific moments where intelligence belongs. Browser Use hands the whole workflow to an agent loop. The developer writes goals. Stagehand treats AI as a tool the programmer reaches for. Browser Use treats it as the programmer.

Stagehand's design choice is, at bottom, an economic observation about intelligence. Most browser interactions, most of the time, are acts of remembering. A login form that worked yesterday will almost certainly work today. AI is slow, expensive, and nondeterministic. So use it to learn, then stop using it until the world changes.

Browser Use makes the opposite bet: the world changes enough, unpredictably enough, that persistent reasoning is worth its cost. For exploratory tasks on unfamiliar sites, this is probably right. For the thousandth run of a known workflow, you're paying for certainty you already have.

This tension will surface everywhere AI meets repetitive work, from extracting prices across ten thousand product pages to routing insurance claims through legacy portals. Adding intelligence to any step is now trivial. Knowing which steps deserve it, and which are just overhead dressed up as diligence, is where the design work lives.

Things to follow up on...

Both frameworks ditched Playwright: Browser Use published a detailed account of why they left Playwright for raw CDP, describing how the abstraction layer obscured browser details that AI agents need to reason well.
The hybrid pattern in production: A practitioner comparison across all three approaches found that many production systems use Playwright for the 80% of predictable steps and reach for AI frameworks only where selectors are likely to break.
Training agents before deploying them: Deeptune just raised $43M to build reinforcement learning environments where agents practice workflows in simulated workplace software before touching production, a different answer to the reliability question these frameworks are solving architecturally.
NVIDIA's trust runtime for agents: At GTC this week, NVIDIA launched OpenShell as part of its Agent Toolkit, an open-source runtime that enforces policy-based security and privacy guardrails for enterprise agents, addressing the governance layer that sits above any individual framework's design choices.

Things to follow up on...

Both frameworks ditched Playwright: Browser Use published a detailed account of why they left Playwright for raw CDP, describing how the abstraction layer obscured browser details that AI agents need to reason well.

The hybrid pattern in production: A practitioner comparison across all three approaches found that many production systems use Playwright for the 80% of predictable steps and reach for AI frameworks only where selectors are likely to break.

Training agents before deploying them: Deeptune just raised $43M to build reinforcement learning environments where agents practice workflows in simulated workplace software before touching production, a different answer to the reliability question these frameworks are solving architecturally.

NVIDIA's trust runtime for agents: At GTC this week, NVIDIA launched OpenShell as part of its Agent Toolkit, an open-source runtime that enforces policy-based security and privacy guardrails for enterprise agents, addressing the governance layer that sits above any individual framework's design choices.