When we're running pricing checks across thousands of hotel sites, monitoring sometimes shows a pattern that initially looks impossible: error rates doubling every 30 seconds—4%, 8%, 16%, 32%—while every individual layer reports normal retry behavior. The frontend shows three retry attempts per failure. The API gateway shows its own retry logic executing correctly. The backend service shows expected retry patterns. Yet somehow, one failed authentication becomes 64 database attempts before any system recognizes something's wrong.
Each recovery mechanism operates independently, unaware of the others—systems designed to improve resilience amplify failures exponentially instead.
The math reveals the coordination problem. Each layer adds its own retry logic:
| Layer | Retries Per Attempt | Total Attempts | Cumulative Total |
|---|---|---|---|
| Frontend | 3 retries | 4 attempts | 4 |
| API Gateway | 3 retries | × 4 | 16 |
| Backend Service | 3 retries | × 4 | 64 |
Three retries at the frontend layer means four total attempts. The API gateway adds its own three retries—turning those four attempts into sixteen. The backend service implements another retry layer—sixteen becomes sixty-four.
During the October 2025 AWS outage, this pattern manifested across millions of clients simultaneously. A DNS misconfiguration triggered retry logic at multiple layers. AWS SDKs defaulted to 3 retries with exponential backoff but no jitter—every client retried at roughly the same intervals, creating synchronized waves that saturated resolver infrastructure. Services like Lambda and API Gateway lacked circuit breakers, adding their own retry loops. A localized DNS failure amplified into a 15-hour outage affecting 110 services.
Reading the Signals
Operating web agents at scale teaches you to recognize retry amplification through specific patterns:
- Packet loss appears first at edge nodes—not gradual degradation but sudden spikes
- DNS resolution failures cascade across services
- Error rates explode exponentially across independent services that shouldn't be related
- Synchronized timing as every client's retry logic triggers at similar intervals
That combination distinguishes retry amplification from capacity exhaustion or isolated failures. Capacity problems show gradual degradation as resources deplete. Isolated failures affect specific components. Retry storms show exponential growth with synchronized timing.
We've learned to watch for another signal: when error rates spike but individual service metrics look normal. Each layer only sees its own retry attempts—three retries looks reasonable in isolation. The multiplication happens across layers, visible only when you're monitoring the entire stack simultaneously across thousands of sessions.
The Counterintuitive Decision
The operational complexity emerges when the solution requires disabling the systems designed to provide resilience.
During the AWS outage, engineers throttled EC2 instance launches to reduce systemic strain. This meant temporarily disabling automated scaling—betting that automated recovery was amplifying the failure rather than containing it. The decision required reading monitoring patterns and reasoning backward: if retry amplification is saturating infrastructure, adding more capacity through automated scaling just adds more retry attempts.
You're making a counterintuitive decision—disabling automated recovery during a failure—based on pattern recognition developed through production experience. Get it wrong, and you've disabled the systems that could help recovery. Get it right, and you've stopped the amplification that's preventing recovery. This kind of judgment emerges from encountering these patterns repeatedly, learning to read the signals that distinguish retry amplification from other failure modes.
When we're operating thousands of concurrent browser sessions hitting authentication endpoints, we encounter these coordination problems constantly. Multiple agents retry simultaneously. Session management adds its own retry layer. Rate limiting triggers more retries. Each recovery mechanism assumes single-session behavior. A single browser works fine—thousands of browsers acting simultaneously create coordination problems that only become visible when you're running web agents at this scale.
Things to follow up on...
-
Recovery takes four phases: Even after AWS fixed the DNS misconfiguration during the October 2025 outage, application-level recovery required additional hours as systems cleared backlogs, reset circuit breakers, and exhausted retry logic.
-
Circuit breakers need tuning: If not monitored frequently, circuit breakers can reduce throughput and cause prolonged downtime even after downstream services recover, creating their own coordination problems.
-
AI identifies cascade patterns: Recent SRE implementations show that machine learning can correlate related incidents based on service topology and transaction spikes, but tuning requires month-long pilot phases where engineers label false positives.
-
Race conditions at scale: The AWS outage root cause was a race condition between two DNS Enactor processes where timing caused an outdated plan to be applied and immediately deleted, removing all IP addresses for the DynamoDB endpoint.

