What You Lose When You Remove the Screenshots

I've watched production web automations fail when sites redesign their layouts. The agent can still parse the HTML—all the right elements are there—but it doesn't know where to look anymore. A login button moves from top-right to bottom-left. A confirmation dialog appears in a different position. The visual context that helped the system understand "this is important" or "wait for this before proceeding" disappears when you're working with text-only DOM.

The Core Trade-off

Removing visual information solves cost and latency problems. Spatial cues that help agents adapt when sites change go with it.

Satya Nitta is making this trade with Agent-E. Strip HTML pages down to text elements, discard visual information, and see if you can match or beat vision-based approaches while solving the cost and latency problems that make multimodal agents expensive to run.

Nitta spent five years at IBM Research building AI tutors before concluding that AI's value comes from automating work humans don't want to do, not replacing expertise. After co-founding Merlyn Mind and leaving in 2024, he launched Emergence AI to test that thesis through web automation.

Raw HTML DOMs contain thousands of tokens—expensive LLM calls, slow response times. DOM distillation reduces that payload by orders of magnitude, keeping only interactive elements and relevant text. Agent-E uses a two-tier structure: a planner breaks tasks into subtasks while a browser navigation agent handles execution, switching between three DOM representations depending on what the task requires.

On the WebVoyager benchmark—643 tasks across 15 live websites—Agent-E achieved 73.2% success rate using only text, beating multimodal agents by 10-30%. The other optimization is skill harvesting: agents analyze successful executions and extract reusable patterns. One example shows reducing a five-step Amazon search to a single LLM call after harvesting a specialized search skill.

When Optimization Creates Brittleness

Skill harvesting produces patterns that work until sites change. I've maintained pattern libraries extracted from successful runs. The maintenance burden isn't obvious until you're running at scale. Patterns break when sites update, and you don't always know which patterns are stale until they fail in production. A harvested skill that worked perfectly last month stops working because Amazon changed how search results render. You're debugging why success rates dropped without clear signals about what changed.

Removing visual context saves tokens but strips away the signals that help agents adapt. Button placement changes. Loading indicators appear differently. Visual confirmation that an action completed. All gone when you're working with text-only DOM. The system can still parse the HTML, but it loses the spatial and visual cues that make ambiguous situations resolvable.

Agent-E's technical paper acknowledges the system "is not currently designed considering a dynamically changing environment." The GitHub README is more direct:

“

"Tests may not pass consistently due to changes in live websites."

The gap between 73.2% benchmark success and sustained reliability when sites update daily requires operational work to bridge.

Emergence raised $97.2 million in June 2024 and launched their CRAFT platform in June 2025. They've announced partnerships with Samsung and National Instruments. But no named customers using Agent-E in production, no scale metrics showing how frequently harvested skills break or what maintenance burden looks like at enterprise scale.

Text-only DOM might beat vision-based approaches while staying cheaper and faster. Sustained reliability when websites change constantly—the scenario Agent-E acknowledges it wasn't designed for—remains to be demonstrated.

Things to follow up on...

WebVoyager benchmark design: Agent-E was tested on WebVoyager's 643 tasks across 15 live websites, which the paper describes as more representative than static self-hosted benchmarks like WebArena.
Nitta's IBM Watson experience: After five years attempting to build AI tutors at IBM Research, Nitta concluded that "we'll have flying cars before we will have AI tutors" because teaching is a deeply human process AI cannot meaningfully replicate.
Flexible DOM representations: Agent-E switches between three DOM representations—text_only, input_fields, and all_fields—with the browser navigation agent selecting which representation fits the current subtask requirements.
CRAFT platform launch: Emergence's enterprise platform launched in June 2025, positioning itself to "supercharge operational efficiency" across financial services, e-commerce, semiconductors, and supply chain sectors.

When Optimization Creates Brittleness

Agent-E's technical paper acknowledges the system "is not currently designed considering a dynamically changing environment." The GitHub README is more direct:

“

"Tests may not pass consistently due to changes in live websites."

The gap between 73.2% benchmark success and sustained reliability when sites update daily requires operational work to bridge.

Things to follow up on...

WebVoyager benchmark design: Agent-E was tested on WebVoyager's 643 tasks across 15 live websites, which the paper describes as more representative than static self-hosted benchmarks like WebArena.
Nitta's IBM Watson experience: After five years attempting to build AI tutors at IBM Research, Nitta concluded that "we'll have flying cars before we will have AI tutors" because teaching is a deeply human process AI cannot meaningfully replicate.
Flexible DOM representations: Agent-E switches between three DOM representations—text_only, input_fields, and all_fields—with the browser navigation agent selecting which representation fits the current subtask requirements.
CRAFT platform launch: Emergence's enterprise platform launched in June 2025, positioning itself to "supercharge operational efficiency" across financial services, e-commerce, semiconductors, and supply chain sectors.