Drop-Downs That Aren't Really Drop-Downs

Browser automation keeps breaking against the web's tacit complexity. Orientation is what scripts were never built for.

By Rina Takahashi— April 8, 2026

Drop-Downs That Aren't Really Drop-Downs

Browser automation keeps breaking against the web's tacit complexity. Orientation is what scripts were never built for.

Every pharmacy in America downloads invoices from the same handful of distributors. Every county title office publishes records through a web portal. Every insurance carrier offers an online quoting tool. None of them work the same way.

The fields have different names. The navigation follows different logic. Drop-downs masquerade as text boxes, checkboxes arrive pre-checked, and search bars turn out to be buttons. This is the web's tacit complexity. No API specification captures it, because there is no API. No script handles it reliably, because the script assumes a stable structure and the structure shifts between sites, between pages, sometimes between visits. The complexity lives in the surface itself, in the gap between what an element looks like and what it actually does.

A person clicks through this kind of thing without pausing. Automation hits the same page and stalls.

So why did this particular problem become the founding bet for Skyvern, the browser automation company Suchintan Singh and Shuchang Zheng started out of Y Combinator?

Where the Problem Became Visible

Building ML pipelines for marketplace search requires data. Lots of it. Product characteristics, user behavior, store-level preferences. At Faire, tailoring search to each retailer meant understanding what they actually stocked. Much of that data lived behind web interfaces nobody had built an API for. Singh spent years building ML platforms at Faire and Gopuff, and the recurring lesson held: model capability kept improving every quarter, but clean inputs stayed scarce. The bottleneck was always upstream.

Zheng arrived at the same gap from the reliability side. At Lyft, he'd built testing tools for over a thousand engineers to keep systems from failing during peak events. At Patreon, he'd scaled payment infrastructure to eliminate a three-day monthly code freeze for a hundred engineers. Both founders understood what brittle, browser-dependent systems cost at engineering scale. The data-access problem and the reliability problem converge when your automation breaks silently because a portal changed its layout overnight.

Skyvern was their third product. They'd tried an onboarding tool and an ML platform for marketplaces before arriving here. The web kept being the thing that got in the way.

Reading Instead of Mapping

Skyvern's response to the orientation problem: the agent uses computer vision and language models to parse what's on the screen in real time, bypassing pre-mapped selectors and DOM structures that break when a site changes its layout. Computer vision identifies what's on the page. Language models interpret what each element is asking for and decide what to do with it. The agent can operate on sites it has never encountered before, because it's reading the page as it appears rather than following a map someone drew last month.

The small moments of contextual inference show the approach most clearly. When a site asks "Were you eligible to drive at 21?" the agent needs to work out the answer from a license date and a birth year. When Delaware's government portal goes offline at night or requires a fax to proceed, the system needs enough awareness to handle the situation gracefully rather than failing silently. A person encountering a confusing form brings tolerance for ambiguity, the ability to recognize when something doesn't make sense, a willingness to infer. Skyvern's bet is that vision and language models can approximate enough of that tolerance to function where scripts cannot.

Whether this holds against the full adversarial messiness of the live web is genuinely open. Sites shift. Sites detect. Sites deceive. The orientation problem has a harder version: navigating territory that is actively trying to disorient you. Nobody has cleanly solved that. But the founding insight is specific and worth sitting with. Orientation is the bottleneck for browser automation, and Singh's years feeding ML models from the messy surface of the web made that particular problem impossible to unsee.

Things to follow up on...

The compounding accuracy problem: If an agent achieves 85% accuracy per step, a 10-step workflow succeeds roughly 20% of the time, which helps explain why only 10% of organizations successfully scale agent pilots to production.
Amazon's production lessons: AWS published a detailed account of what it takes to keep agents reliable at scale, including the burden of manually defining tool schemas for hundreds of APIs and the necessity of human-in-the-loop evaluation.
Agent observability as unsolved layer: Most teams trace whether an agent ran, not whether it behaved correctly, and the orchestration challenges in multi-agent systems are compounding faster than the tooling to diagnose them.
MCP's production growing pains: The official 2026 roadmap from MCP's lead maintainer identifies structural gaps including stateful sessions fighting load balancers and no standard way for registries to discover what a server does without connecting to it.