The Twenty-Four Hour Problem

During a site migration, our web agents encounter something strange: different parts of our infrastructure see different versions of reality. DNS resolvers return different IP addresses for the same domain. Requests time out. Authentication fails. A 1983 architectural choice, working exactly as designed.

Paul Mockapetris created the Domain Name System to replace the centralized HOSTS.TXT file that couldn't scale with the growing internet. Root servers needed protection—they couldn't handle every query from every machine. So he built in aggressive caching. RFC 1034 recommended TTL values "on the order of days" for typical hosts. The common default became 86400 seconds. Twenty-four hours.

For a network of research institutions where hosts rarely moved and IP addresses rarely changed, this made perfect sense. DNS changes were rare, coordinated events. An update could wait a day to propagate. The caching protected infrastructure while serving a world where change itself was exceptional.

That world is gone.

The number remains.

Truth on a Timer

At TinyFish, we operate web agents across thousands of sites that need to work reliably every time. That 1983 decision shows up constantly: a distributed system where truth is relative and time-delayed.

A site updates DNS records—migrating domains, switching CDNs, updating IP addresses. Those changes don't propagate instantly. They ripple outward as cached records expire and refresh. Different DNS servers cache information at different times, creating windows where different resolvers return different answers. Propagation typically takes 24 to 48 hours.

A travel site migrates from one CDN to another. They update DNS records at 3 PM. For the next 24 hours, our agents in different regions see different infrastructure. Some connect to the old CDN, others to the new one. Session cookies issued by the old infrastructure don't work on the new infrastructure. Requests that should succeed fail unpredictably—not because anything broke, but because DNS caching means different parts of the system are operating in different moments of time.

When DNS servers disagree during updates, the response depends on which server was queried. That response becomes truth. No reconciliation mechanism exists. For web automation running at scale, this breaks assumptions about what "working" even means.

The Inversion

The original architecture assumed changes were rare and coordination was possible. You'd plan a migration, notify stakeholders, wait for propagation, verify everything worked. The system optimized for steady-state performance at the cost of slow updates.

Modern web operations turned this completely around.

Sites run continuous deployments. CDNs switch traffic dynamically. A/B tests route users to different infrastructure by region. DNS changes went from rare, coordinated events to constant, distributed updates happening across thousands of sites simultaneously.

The caching that protected root servers now creates problems at every layer. Operating web agents that need reliable access across thousands of sites, we can't assume DNS resolution returns consistent answers. We build infrastructure that treats DNS inconsistency as something to expect, not an edge case:

Monitoring resolution across multiple locations
Detecting when different resolvers diverge
Implementing retry logic that accounts for propagation delays
Maintaining fallback strategies when DNS returns stale data

The Invisible Default

Most teams encountering DNS propagation delays don't connect them to that 1983 architectural choice. It feels like an inevitable property of the internet—just how DNS works—rather than a design decision optimized for constraints that no longer exist.

A site migration causes intermittent failures for 24 hours. Teams troubleshoot their code, their infrastructure, their CDN configuration. They rarely think: this is Mockapetris's caching decision protecting root servers on a network where changes were rare and coordinated.

RFC 1034 explicitly stated the rationale: "the realities of Internet performance suggest that these times should be on the order of days." Those realities were server load and bandwidth on a research network. The performance characteristics that mattered then are not the performance characteristics that matter now.

The web outgrew those constraints decades ago. Contemporary practices commonly use 300-second TTLs for records needing fast updates. Yet the architecture remains: a distributed caching system where updates propagate slowly, truth is relative, and different observers see different versions of the same domain for hours.

When you're building web automation at scale, this becomes infrastructure you architect around daily. The elegant solution that protected root servers became the challenge we navigate now—one twenty-four-hour default at a time.

Things to follow up on...

Pre-migration TTL reduction: Administrators commonly reduce TTL values to 5-10 minutes before DNS changes to minimize propagation delays, though this requires planning maintenance windows farther ahead than the configured TTL value.
ISP caching behavior: Some internet service providers overlook TTL rules and keep cached DNS records even after expiration, which can make DNS propagation take significantly longer than the TTL value suggests.
Anycast monitoring challenges: DNS infrastructure relies heavily on Anycast routing, which means organizations must monitor from widely dispersed locations matching their customer base rather than just a few centralized monitoring points.
Resolver TTL overrides: The DNS protocol supports caching for up to sixty-eight years or no caching at all, and some resolvers may override TTL values set by authoritative servers, creating additional unpredictability in propagation timelines.