When we're running web agents that monitor hotel inventory across thousands of properties—each with different authentication flows, regional variations, and rate limits—one question shapes everything else: how long does this workflow need to remember what it's doing?
That question drives infrastructure complexity, recovery mechanisms, and what workflows you can run reliably. State persistence exists on a spectrum, and understanding where your workflows fall tells you what you actually need to build.
How the Framework Emerged
Operating web agents at scale taught us that "stateless versus stateful" oversimplifies the problem. What matters is matching durability guarantees to workflow behavior.
A quick availability check that completes in seconds? If it fails halfway through, just restart. The workflow is fast enough that preserving intermediate state costs more than re-running it. But a multi-step verification flow that authenticates, navigates regional booking systems, extracts structured data, and validates against business rules? That workflow carries context across steps, potentially spans hours, and losing progress means wasted compute and delayed results.
The durability spectrum runs from fully stateless execution (no memory between attempts) through session-based state (temporary context during active work) and checkpointing (save progress at milestones) to comprehensive state management (full execution context survives anything).
Where your workflows fall tells you what infrastructure you need.
Applying the Framework
Ask yourself: what happens if this workflow gets interrupted?
Workflows measured in seconds can restart from scratch. Annoying, but acceptable. The infrastructure stays simple, no state storage needed, no recovery mechanisms. Microsoft's Agent Framework makes this explicit: agents are stateless and don't maintain state internally between calls.
Workflows measured in minutes that need context across steps require session-based state. The agent remembers login credentials and navigation state while actively working. If the session expires during a rate limit pause or when authentication tokens time out, you start over. For workflows that take 10-15 minutes, this is manageable. For workflows coordinating across multiple regional systems over an hour, it's a reliability problem.
Workflows measured in hours that coordinate across systems need checkpointing. When we're extracting data across thousands of properties with authentication labyrinths and regional variations, checkpointing means pauses for rate limits or transient failures don't force restarts from the beginning. The workflow saves progress at milestones and resumes from there.
When state loss means business impact—multi-agent coordination, operations spanning days, scenarios where reliability requires infrastructure—comprehensive state management becomes necessary. Full execution context survives anything.
What Teams Misjudge
The teams we see struggle most aren't the ones who choose stateless when they need durability. That breaks obviously during testing.
It's the teams who build comprehensive state management for workflows that don't need it, then discover they've committed to infrastructure complexity that slows everything down.
A single-shot price check doesn't need the same durability guarantees as a multi-day verification workflow coordinating across systems. Without this framework, teams either over-engineer simple workflows or under-engineer complex ones. Dapr's distinction between ephemeral agents and durable agents illustrates this fork: synchronous interaction versus asynchronous, autonomous execution with persistent state.
Map your workflow patterns to durability requirements before architectural decisions become commitments. The answer shapes infrastructure complexity, recovery mechanisms, and what workflows you can reliably run at scale. More importantly, it shows you what you can avoid building.
Things to follow up on...
-
Multi-agent coordination patterns: When workflows require multiple agents working together, distributed state stores optimized for multi-agent access patterns become essential for maintaining session-level consistency across concurrent operations.
-
Human-in-the-loop durability requirements: Agents that rely on human feedback for approval or clarification need resumable state and support for arbitrarily long delays between steps, fundamentally different from fully automated workflows.
-
Recovery time objectives matter: While frameworks discuss recovery mechanisms, understanding specific recovery time objectives and recovery point objectives helps teams evaluate whether infrastructure can handle months-long processes with guaranteed resumption.
-
Observable pattern for production: Successful production implementations follow the observable pattern where every action and decision is logged, queryable, and traceable, essential for understanding how agents actually operate at scale.

