When Restarting From Scratch Costs Less Than Saving Progress

The monitoring alert fires at 2am. Workflow crashed at step four of seven. The on-call engineer pulls up the logs: memory spike, container restart, session gone. Should those first three successful steps have been saved somewhere?

Teams running high-frequency, short-duration workflows restart from the beginning. No checkpoints, no saved state, no coordination infrastructure. Launch a new browser, re-execute all seven steps, complete before anything else breaks.

The operational alternative costs more than accepting occasional restarts. Implementing checkpoints, coordinating state across workers, managing cleanup for thousands of sessions—the engineering overhead exceeds the cost of re-execution. Production math drives the decision.

Memory Pressure at 10,000 Concurrent Sessions

Memory pressure doesn't hit uniformly. Workflows 1-150 complete fine. Workflow 151 triggers garbage collection. That pause causes workflow 152 to timeout. Two failures cascade into four, then eight. The system stable at 150 concurrent sessions falls apart at 160, and the failure pattern isn't obvious from any single workflow's logs.

The Cascade Pattern

At scale, container restarts don't just lose individual sessions—they trigger cascading failures that multiply across the fleet.

Container restarts delete sessions mid-workflow. Workers crash and need replacement. Network hiccups that seem momentary cause complete state loss. The workflow progressing normally crashes because the container ran out of shared memory.

Session is gone. Cookies, authentication tokens, form data—all in memory, all disappeared. The workflow restarts from step one and re-executes everything.

How Teams Manage 5,000 Daily Workflows

Teams running 5,000 workflows daily with three-minute average duration: 4% fail before completion. That's 200 restarts, 600 minutes of re-execution. The checkpoint infrastructure they'd need—coordinating state across twelve workers, managing cleanup for 5,000 sessions daily, handling synchronization delays at each step—costs more in engineering time and operational overhead than accepting those 600 minutes of restarts.

Teams running high-frequency workflows focus on keeping sessions alive long enough to complete without hitting resource constraints:

Monitor memory usage closely
Set strict session timeouts
Maintain browser pools to avoid zombie sessions accumulating until file descriptor limits hit

A three-step workflow that completes in ninety seconds doesn't need elaborate persistence. If it fails, restarting costs ninety seconds. For teams running controlled environments with sufficient memory and stable networks, completing 95% of workflows on first attempt means the 5% that restart still finish faster than adding coordination overhead to all 100%.

A Fifteen-Minute Workflow That Fails at Step Six

Thirteen minutes of progress disappears. When it resumes, it starts from scratch—no authentication, no collected data, no memory of the five successful steps that already completed.

At 15% failure rate, you're spending more time re-executing successful steps than you would spend on checkpoint coordination. Authentication overhead compounds: a workflow that authenticates fresh on every run spends forty seconds on login alone. Multiply that across 10,000 daily runs and you're burning 111 hours just logging in.

The operational constraint: workflows that must pause between steps. A compliance check that authenticates, scrapes forty pages of transaction history, then waits for manual review before continuing. That manual review takes two to six hours. You can't keep a browser running that long—memory leaks accumulate, sessions timeout, containers get recycled.

The workflow that worked fine at fifteen minutes breaks completely at three hours. Teams implement checkpoints when the operational constraint forces it.