What Browser Use Found When They Stopped Looking at Screenshots

Browser Use ditched screenshots for structured text, and what they learned about web variance reshapes how agent benchmarks should work.

By Rina Takahashi— March 4, 2026

What Browser Use Found When They Stopped Looking at Screenshots

Browser Use ditched screenshots for structured text, and what they learned about web variance reshapes how agent benchmarks should work.

Most browser agent frameworks start by taking a screenshot. They feed the image to a model and ask it to figure out where to click. Magnus Müller and Gregor Žunič, both out of ETH Zurich, went a different direction. Their open-source framework, Browser Use, converts web pages into structured text that language models can process directly. The prototype took four days. It hit Hacker News, collected over 50,000 GitHub stars, became a YC W25 company, and closed a $17M seed led by Felicis in March 2025. All worth noting, and all prelude to what came next.

Parsing a page's actual structure means working closer to what the page is. A screenshot flattens everything into pixels. Text extraction exposes the elements, the states, the things that shift between loads. As Felicis noted in their investment thesis, this means agents can "deterministically re-run workflows" without requiring expensive per-instance visual inference.

That re-run capability has consequences well beyond efficiency. Run the same workflow against the same site ten times, and you start seeing the web as it actually behaves. A popup loads 100 milliseconds late and throws off an entire agent trajectory. An A/B test swaps a layout between runs. A site quietly restructures its authentication. The page is never quite the same page twice. A vision-based agent encountering these shifts might just fail differently each time, with no clear signal about what changed. A text-based agent parsing structure can surface where the page diverged from expectation. Variance becomes legible. And legible variance is useful.

The same instinct led Browser Use toward benchmarking. In January 2026, the team published an evaluation comparing frontier models on real web tasks sourced from millions of LLM-labeled user sessions:

“

"Synthetic environments completely fail to capture the bizarre reality, complexity, and ugliness of how the actual web is built."

Real sites delivered chaos that synthetic environments had quietly edited out. Pages behaving differently depending on load timing, geography, session state. Tasks that succeed on one run and fail on the next for reasons that have nothing to do with the agent. The team ran each evaluation multiple times and reported standard error bars, noting that many existing agent benchmarks include neither. "That lack of statistical rigor is alarming," they wrote in their companion post.

Their 5,000-member Discord community had been arriving at the same conclusion from the production side. A site restructures its checkout flow on a Tuesday, and an automation that ran cleanly for weeks starts returning confident, wrong results. The agent doesn't error out. It completes the workflow against a page that no longer matches the one it learned. Bot detection, rate limits, authentication friction layer on top. These are core challenges the team identified through community conversations. Synthetic test environments include none of them. They show up immediately on real deployments.

Follow the thread to its end. A technical choice about how to read web pages made it possible to re-run workflows reliably. Reliable re-runs made variance measurable. Measurable variance revealed that existing benchmarks weren't capturing production conditions. And a large community running real tasks on real sites confirmed the gap: the web's non-determinism is the operating condition. Browser Use's experience points toward something specific about the infrastructure web agents actually need. Knowing, precisely, what the web did back.

Things to follow up on...

Silent failure, documented: CNBC reported two enterprise cases where AI systems completed workflows successfully but produced wrong outcomes, including a beverage manufacturer that overproduced hundreds of thousands of cans because its system misread holiday packaging as an error signal.
Monitoring maturity gap: Cleanlab's 2025 production survey found that only 5% of AI agents in production have mature monitoring, while CB Insights now ranks observability and evaluation as the most dynamic generative AI market by deal activity.
Browser as control plane: A practitioner overview from Browserless describes the broader shift underway: instead of writing scripted sequences, teams are starting to rely on agents that interpret goals and carry out multi-step tasks, turning the browser into an automation control plane rather than just a rendering layer.
The EU deadline looms: By August 2, 2026, the EU AI Act requires providers of high-risk AI systems to have continuous monitoring programs that track performance in real-world conditions, a mandate that doesn't distinguish between a system that crashes and one that silently degrades.

“

"Synthetic environments completely fail to capture the bizarre reality, complexity, and ugliness of how the actual web is built."

Things to follow up on...

Silent failure, documented: CNBC reported two enterprise cases where AI systems completed workflows successfully but produced wrong outcomes, including a beverage manufacturer that overproduced hundreds of thousands of cans because its system misread holiday packaging as an error signal.
Monitoring maturity gap: Cleanlab's 2025 production survey found that only 5% of AI agents in production have mature monitoring, while CB Insights now ranks observability and evaluation as the most dynamic generative AI market by deal activity.
Browser as control plane: A practitioner overview from Browserless describes the broader shift underway: instead of writing scripted sequences, teams are starting to rely on agents that interpret goals and carry out multi-step tasks, turning the browser into an automation control plane rather than just a rendering layer.
The EU deadline looms: By August 2, 2026, the EU AI Act requires providers of high-risk AI systems to have continuous monitoring programs that track performance in real-world conditions, a mandate that doesn't distinguish between a system that crashes and one that silently degrades.