Magnus Müller had been building bots and web scrapers since he learned to code. Gregor Žunič came from physics, then data science. They met at a hacker house at ETH Zurich in 2024, and together built the first version of Browser Use in roughly five days. Müller's own assessment of what they posted to Hacker News: "really shitty." The project now has 81,000+ GitHub stars, a $17M seed round led by Felicis, and a Y Combinator badge. What the traction forced them to learn about the surface they're building on is where the story gets specific.
Their original question was simple enough:
"How hard could it be to build the interface between LLMs and the web?"
The answer is written into every architectural choice they've made since.
What a page looks like to a model
A web page, to a human, is visual. To a language model, it's a wall of noise. Raw HTML carries scripts, styling, hidden elements, layout scaffolding. Feed it directly to an LLM and you exhaust the context window before the model has located a single button.
Browser Use's foundational choice was to skip the visual layer. They parse the DOM, strip it to interactive elements, and hand the LLM a numbered list: [1]<button> Submit [2]<input placeholder="Enter name">. The model reads the list, picks a number, acts. Each screenshot adds roughly 0.8 seconds of latency, and over dozens of steps per task, that compounds. The latency is the simpler half of the problem. The web's visual surface is designed for humans who can glance, scroll, and infer. Models can't glance. They need the page translated into something they can reason over, and the translation work is where the actual engineering lives.
The web's incidental hostility
The modern web's resistance to machine reading is largely incidental. Shadow DOM, iframes, async loading: these were built for encapsulation, performance, and user experience. They just happen to create a surface that automated systems struggle to parse.
Shadow DOM boundaries are invisible walls that XPath cannot cross, because the XPath specification predates Shadow DOM entirely. Nest an iframe inside a shadow root and you get compounding incompatibilities between the web's own architecture and the tools available to navigate it. Machine comprehension was never a design consideration. The web is indifferent to it.
Then there's the state problem. A coding agent works in an environment shaped by its own prior actions. A browser agent faces a world that changes independently. Content loads asynchronously. Modals appear. Elements shift position. The page the model reasoned about two seconds ago may no longer exist.
Browser Use marks newly detected elements with an asterisk to signal change. A small accommodation to a large reality.
When general-purpose breaks
General-purpose LLMs produce verbose output, and output tokens cost roughly 215x more processing time per unit than input tokens. So Müller and Žunič built custom models with action vocabularies compressed to 10–15 tokens per step. That compression only made sense after they'd watched general-purpose models waste tokens on a problem that punishes verbosity. Their BU 2.0 release reported a 12-point accuracy gain over the previous version, reaching 83.3% on their internal benchmark. Each optimization arrived because the previous assumption broke against the actual environment.
That 83.3% is worth sitting with. On a curated benchmark, after purpose-built models and a compressed action space, roughly one in six tasks still fails. Closing that last gap is where demo-grade automation becomes infrastructure you can depend on. Browser Use now occupies exactly that territory.
Eighty-one thousand stars tell you how many people need this negotiation layer between models and the live web. The engineering record tells you why it keeps being hard.
Things to follow up on...
- WebMCP as structural fix: Google's WebMCP proposal would let websites expose structured actions directly to agents, potentially replacing dozens of fragile browser interactions with a single typed function call.
- Reliability degrades super-linearly: A recent empirical study across 23,392 episodes confirmed that agent reliability falls faster than task complexity rises, and that this degradation is invisible to standard benchmarks.
- Operational complexity, not models: Datadog's State of AI Engineering 2026 report found that rate limit errors alone accounted for 8.4 million failed LLM spans in a single month, suggesting infrastructure ceilings matter more than model capability.
- Prompt injection in production: Google and Forcepoint simultaneously published research documenting a 32% increase in malicious prompt injection payloads embedded in the public web between November 2025 and February 2026.

