When Documentation Becomes More Real Than the System

A workflow fails on a specific site. The team checks the documented selector pattern—unchanged for six months. They debug why the selector isn't working: testing syntax variations, reviewing recent code changes, checking if the site's anti-bot measures changed. Twenty minutes in, someone loads the actual site. The HTML structure changed three weeks ago. The workflow has been failing silently since then.

Twenty minutes debugging a selector pattern that stopped working three weeks ago. They knew sites change. But the documentation said it should work.

When web agent systems reach thousands of workflows running daily across hundreds of sites, teams build comprehensive documentation. Every workflow gets documented: authentication sequences, selector patterns, error handling logic, retry strategies. The documentation becomes detailed enough to serve as canonical reference for how systems should behave.

Operator behavior shifts. They stop verifying what systems actually do. They trust what documentation says they do.

Rigorous documentation disconnects teams from system behavior. Comprehensive documentation creates a reflex that changes how operators approach failures. If the documentation says the selector should work, the problem must be in the implementation. The documented pattern becomes more real than the actual site.

Traditional web scraping requires constant maintenance when sites change HTML structure. Teams know this. But comprehensive documentation inverts the operational instinct. They debug why the documented pattern isn't working. The actual site's current state becomes an afterthought.

A cache server fails. Operators follow their runbook—failover to backup, apply the documented configuration. The backup comes online, then immediately fails. They reapply the configuration. Same result.

Three hours into the incident, someone checks the actual running configuration on the primary server. It doesn't match the documentation. The documented failover config was wrong. Three hours debugging why the correct configuration wasn't working.

The Documentation-First Trap

The incident postmortem concluded: documentation was outdated, assign a documentation owner. The team treated documentation as the system to fix—a thing to maintain rather than a reflection of something else.

At scale, teams spend more time maintaining documentation accuracy than they would spend observing actual system behavior. Configuration changes get documented. Sites change their structure. Authentication flows evolve. The actual system keeps running—sometimes correctly, sometimes incorrectly—while the documented system remains frozen at the moment it was written.

Ten thousand workflows, each with documented patterns. Sites change daily. Teams can't verify every workflow against every site change. The documentation becomes operational reality. The thing teams trust when something breaks. Actual system behavior becomes secondary.

Comprehensive documentation creates operational blind spots precisely because it's comprehensive. Documentation drifts—everyone expects that. Teams trust detailed documentation more than they trust observation. The documented workflow persists long after underlying sites changed their behavior. They debug the gap between documentation and reality, treating the documented state as the baseline that reality should match.

The documentation is there, detailed and authoritative. So they keep working to close that gap.

Things to follow up on...

Runbook accessibility during incidents: During production incidents, runbooks stored in Confluence or Notion might as well not exist because on-call engineers toggle between multiple tools under extreme stress before finding documentation that's often outdated or missing context.
Living documentation's conceptual challenges: While living documentation promises to stay current by auto-generating from code, the challenges are more conceptual and organizational than technical in nature—teams struggle with information overload, quality control, and maintaining consistency across multiple authors.
Configuration drift in automated environments: The fast pace of automated cloud provisioning causes configuration drift issues to accumulate quickly simply because changes happen rapidly, with ad hoc fixes often going undocumented without clear change management strategies.
Emergency fixes creating documentation gaps: When there's an outage in production, engineers make emergency configuration changes to fix issues as quickly as possible, but if those changes aren't documented back into source control, configuration drift occurs and the next responder might wipe out a fix someone made earlier.

Things to follow up on...

Runbook accessibility during incidents: During production incidents, runbooks stored in Confluence or Notion might as well not exist because on-call engineers toggle between multiple tools under extreme stress before finding documentation that's often outdated or missing context.

Living documentation's conceptual challenges: While living documentation promises to stay current by auto-generating from code, the challenges are more conceptual and organizational than technical in nature—teams struggle with information overload, quality control, and maintaining consistency across multiple authors.

Configuration drift in automated environments: The fast pace of automated cloud provisioning causes configuration drift issues to accumulate quickly simply because changes happen rapidly, with ad hoc fixes often going undocumented without clear change management strategies.

Emergency fixes creating documentation gaps: When there's an outage in production, engineers make emergency configuration changes to fix issues as quickly as possible, but if those changes aren't documented back into source control, configuration drift occurs and the next responder might wipe out a fix someone made earlier.