Where Intelligence Lives in Each Step

The most consequential design choice in browser automation is deciding where each step needs model intelligence and where determinism suffices.

By Rina Takahashi— May 21, 2026

The most consequential design choice in browser automation is deciding where each step needs model intelligence and where determinism suffices.

A login form has a username field with a label, a password field with a type attribute, and a button that says "Sign In." A Playwright selector handles this in milliseconds. No model call, no tokens, no ambiguity.

Two steps later, the same workflow hits a CAPTCHA. The DOM contains a canvas element. Inside the canvas: a bitmap. The accessibility tree has nothing useful to say about it. The only way through is to send a screenshot to a vision model and ask what it sees.

Same workflow. Completely different amounts of intelligence required per step. That gap drives every other decision downstream.

Three layers, one workflow

Deterministic selectors sit at the bottom. CSS selectors, ARIA role queries, Playwright locators. Zero model calls, sub-millisecond execution, completely predictable. When the DOM shifts (dynamic class names, CSS-in-JS hashes, framework redeployments), they break silently.

The accessibility tree sits in the middle, and most teams underestimate it. This is the data structure screen readers use: roles, labels, states, focusable elements. Chrome builds it by walking the DOM and filtering out anything without semantic or interactive value. What's left is a compact representation of what a page actually does: its interactive surface, stripped of decoration. A page that costs 3,000–5,000 tokens as a screenshot might be 200–500 tokens as an accessibility tree snapshot. Over a 10-step task, that's roughly a 6x cost difference.

The tree has specific, knowable boundaries: canvas elements are invisible to it, custom components without ARIA labels vanish or get wrong roles, and closed shadow DOM returns null. You can map all of these conditions before anything runs.

Screenshots and vision sit at the top. Expensive, slow, flexible. Vision handles canvas-rendered dashboards, image-heavy interfaces, visual verification, and the anti-bot challenges that evaluate hundreds of behavioral signals beyond anything DOM inspection can reach.

Layer	Token cost	Speed	Breaks when…
Deterministic selectors	Zero	Sub-millisecond	DOM shifts silently
Accessibility tree	~200–500 per page	Low	Canvas, missing ARIA, closed shadow DOM
Screenshots + vision	~3,000–5,000 per page	Slowest	Most flexible fallback

What the reliability numbers show

A practitioner comparison across structured tasks puts DOM-driven approaches 12–17 percentage points ahead of vision-driven ones on forms, navigation, and data extraction. On visually complex benchmarks like VisualWebArena, pure-DOM approaches underperform. DOM access is more reliable for the roughly 80% of tasks where the DOM is well-structured. Vision is required for the rest.

So: how gracefully does a system move between them within a single workflow? A well-designed system detects when a selector fails mid-run and escalates. Try the accessibility tree for a structured fallback. If the tree has nothing useful (canvas, missing ARIA labels), take a screenshot and hand it to a vision model. That cascade can happen per-step, invisibly. And the per-step choice matters because the right layer can change during a run. A form that was pure DOM yesterday might have a canvas CAPTCHA injected tomorrow.

Amortized cost and the caching pattern

Some frameworks cache the mapping from a first AI-directed run, then replay those cached actions deterministically on subsequent passes. Zero tokens, sub-100ms latency. If a cached action fails because the page changed, the system re-engages the model to figure out the new mapping, caches that, and carries on. For workflows running hourly across hundreds of pages, the amortized cost per task drops to nearly nothing after the first pass.

The core pattern

Use model judgment where the world is ambiguous, compile to determinism where it isn't, and know which step is which.

That third part is the hardest. Often you don't know whether a step will be ambiguous until the first run actually encounters it. The caching pattern is one honest answer: you learn by running, then compile what you learned. The first pass is exploration. Everything after is replay, until the world changes again.

Selectors will break eventually. The intent behind them is more durable. A system that lets you make the intelligence choice per-step, rather than committing to one philosophy for the whole workflow, is the one that holds up when the page changes underneath you.

Things to follow up on...

DOM distillation without vision: A pure-text approach using DOM distillation and "skill harvesting" (remembering successful action patterns) scored 73.1% on WebVoyager without any vision model, compared to 89.1% for hybrid DOM-plus-vision frameworks — a narrower gap than you might expect.
Playwright MCP's accessibility-first design: Microsoft's Playwright MCP exposes a headless browser controlled via accessibility tree snapshots instead of raw screenshots, letting LLMs interact with pages through structured semantic data rather than pixel-based input.
Selector maintenance costs quantified: Practitioner reports suggest Playwright scripts require 15–25% selector fixes within 30 days of deployment on live sites, with some teams spending 40–60% of testing time on maintenance rather than writing new automations.
LLM vs. code-driven orchestration: A useful pattern emerging in production systems pairs code-driven outer workflows (known business processes, predictable routing) with LLM-driven inner execution where model reasoning actually adds value at localized scope.