What It Actually Takes to Resume a Crashed Workflow

The workflow crashes at step six of seven. Thirteen minutes of progress disappears. Authentication that succeeded ten minutes ago is gone. Form data collected in step two vanished. The workflow restarts from the beginning and re-executes everything.

For teams running long workflows across distributed fleets, this pattern becomes unsustainable. The operational cost of re-executing successful steps exceeds the cost of implementing persistence. What does checkpoint coordination actually require in production?

Compliance Workflows That Can't Keep Browsers Running

When Workflows Must Pause

Compliance workflows that require hours of manual review between steps can't keep browsers running—checkpoints become operational necessity.

Authenticate to a compliance portal, scrape transaction history across forty pages, export to CSV, then wait for manual legal review before continuing to the next section. That manual review takes three hours on average, sometimes six.

You can't keep a browser running that long. Memory footprint grows unbounded during long-running sessions—reloading a page frees some allocated memory but not all. Session cookies expire. Containers get recycled on schedule. The workflow needs to terminate after authentication and resume hours later with state intact.

After each workflow step completes, serialize the browser's state to disk. Cookies, localStorage, session storage, authentication tokens—everything needed to resume from that point. The browser terminates. When the next step needs to execute, a new browser instance launches, loads the saved state, continues from the checkpoint.

Coordinating State Across Distributed Workers

State saved on worker A needs to be accessible when the workflow resumes on worker B. User data directories are stored locally on each worker's file system and do not sync across distributed fleets. Teams must track which worker holds which session, route resumed workflows correctly, or implement shared storage that all workers can access.

Checkpoint data accumulates. Jobs crash, get retried mid-process, or are abandoned due to external interruptions. These sessions remain active until TTL expires. Teams implement periodic cleanup routines that check for expired or unused sessions and remove them. Relying on TTL expiration alone leads to resource retention longer than necessary—especially when workflows crash mid-run without cleaning up after themselves.

Session cookies without explicit expiration times don't persist across browser restarts, even with checkpoint mechanisms in place. Workflows that relied on session cookies for authentication start failing after adding checkpoints. The workaround involves different serialization strategies, each with trade-offs around what state persists and what gets lost.

Checkpointed Workflows Across Distributed Fleets

After saving state, there's a brief delay before the updated checkpoint is ready for use. Teams must pause before reusing the same session to ensure data is properly synchronized. This adds latency that compounds across multiple checkpoints.

For a seven-step workflow with checkpoints after each step, that synchronization delay adds up. The workflow that completed in fifteen minutes without checkpoints now takes eighteen minutes with coordination overhead. But when the workflow crashes at step six, it resumes from the checkpoint rather than starting over. The thirteen minutes of progress that would have disappeared are preserved.

Teams focus on surviving crashes:

Track session locations across workers
Implement cleanup routines for expired state
Handle synchronization delays
Structure workflows to checkpoint at appropriate boundaries—after expensive operations complete, before long-running steps begin

When Coordination Overhead Makes Sense

Teams running 3,000 workflows daily with fifteen-minute average duration see 12% failure rate. That's 360 failures.

Approach	Daily Cost	Calculation
Without Checkpoints	5,400 minutes	360 failures × 15 min each
With Checkpoints	4,680 minutes	(3,000 workflows × 3 min overhead) - (360 failures × 12 min saved)
Net Savings	720 minutes daily	Despite coordination overhead across all workflows

With checkpoints, workflows resume from the last successful step. A failure at step six means re-executing only step six, not steps one through five. The coordination overhead—synchronization delays, cleanup routines, session tracking—adds three minutes per workflow. Total: 9,000 minutes of coordination overhead daily.

But the 360 failures now cost only 1,080 minutes to recover (three minutes per failure to resume from checkpoint) instead of 5,400 minutes to restart from scratch. The operational savings: 4,320 minutes daily, even accounting for coordination overhead across all workflows.

Teams make this decision based on workflow length, failure rates, environment stability, and the cost of re-executing successful steps. The operational reality determines which pattern survives in production.