Playwright's CLI mode uses about 27,000 tokens for a typical browser automation task. Its MCP mode uses roughly 114,000. A 4x difference. And the number is a side effect of a design fork about where browser state lives between steps.
Both modes produce the same thing. A YAML serialization of the page's accessibility tree, with element references the agent uses to click, type, or navigate. What differs is where that serialization goes next.
In MCP mode, the snapshot streams into the LLM's context window. Every interaction adds a new one. Old snapshots don't get cleared. By step 12 of a workflow, the conversation is carrying accessibility trees from pages the agent left behind ten steps ago. One developer running a 15-test suite watched six tests consume 151,000 of 200,000 available tokens. A single browser_take_screenshot call had consumed 232,000 tokens by itself, surviving only because the image data got truncated before it landed.
In CLI mode, the snapshot gets written to a file on disk. The agent reads it when needed. Step 50 costs roughly what step 5 costs, because page states aren't accumulating in working memory. Microsoft's README is direct about the split:
"CLI for coding agents that need to balance browser automation with large codebases; MCP for exploratory automation where maintaining continuous browser context outweighs token cost concerns."
The recommendation follows from what the client can do. CLI requires filesystem access, which means it works with Claude Code, GitHub Copilot, and Cursor. Sandboxed chat interfaces like Claude Desktop can't write to disk, so they use MCP. The agent doesn't get to choose where state lives. The client's architecture already made that decision.
The cost story is clear enough. What happens after the session ends is where the two modes diverge further.
CLI snapshots are files. Files can be diffed. The SKILL.md in Playwright's repository explicitly teaches agents this workflow: save a snapshot before an action, save another after, run diff. Nobody set out to design "diffable browser state." It's what falls out when state is a file. The same logic extends to CLI's show dashboard for live observation and its tracing with full DOM snapshots and network logs. These capabilities emerge naturally from the decision to externalize state. You can verify, independently of the agent's own report, whether a click actually changed the page.
MCP doesn't produce artifacts. Browser state exists only inside the conversation history, tangled with the agent's reasoning turns. When something goes wrong, you can't separate what the agent saw from what the agent concluded. Observation and inference live in the same stream, so debugging means reading backward through accumulated context trying to reconstruct which snapshot the agent was actually reasoning against when it made the wrong call. Practitioners describe diagnosing these failures as chasing "flaky" behavior. The agent is reasoning against stale state that never got cleared, and the flakiness is a symptom of that accumulation.
CLI has its own failure mode, though. Because snapshots sit on disk instead of arriving automatically in context, the agent has to decide when to look and what to look for. MCP gives the agent everything, including data it no longer needs, but also continuity it doesn't have to actively maintain. For workflows requiring deep, iterative reasoning about page structure, that automatic injection is more natural. CLI trades accumulation risk for a different kind of brittleness: the agent not reading what it needs to.
The token gap at step 7 becomes 10x by step 15. The gap between a session that leaves behind inspectable artifacts and one that leaves behind a conversation log widens on a similar curve. One stores what the agent saw and what the agent thought in different places. The other keeps them in one stream, inseparable after the fact.
The externalization saves tokens. It also saves the ability to audit what happened, which shows up in a different column on a longer timeline.
Things to follow up on...
- Agentic traffic at scale: HUMAN Security's April 2026 report found that agentic browser traffic grew 8,000% year-over-year, which means the infrastructure choices underneath browser agents are scaling into production whether or not the design questions are settled.
- Pass^k and session reliability: The tau-bench benchmark introduced pass^k to measure agent consistency across repeated trials, and the exponential decay it surfaces is exactly the kind of reliability question that session-length context accumulation makes worse.
- Context as working memory: Mem0's 2026 benchmarks showed a two-layer memory architecture scoring 91.6% accuracy at 4x fewer tokens than full-context baselines, which suggests the disk-vs-context split in Playwright may be an early instance of a broader architectural pattern.
- WebDriver BiDi's real-time model: The W3C published a Working Draft for WebDriver BiDi in June 2026, defining bidirectional browser communication over WebSockets that could eventually change how agents receive state updates altogether.

