The Approval That Never Came

An agent pauses mid-workflow, waiting for someone to approve a sensitive transaction. The operator steps away for a meeting. Twenty minutes later, they return to click "approve"—and discover the agent has no memory of ever making the request.

The workflow is gone. Fifteen steps of reasoning, data gathering, and validation have vanished. Not because something crashed. Because the infrastructure assumed the agent would either finish or fail cleanly, not suspend itself mid-thought and wait.

Pausing breaks conventional infrastructure assumptions in ways that aren't obvious until production. Operators discover this when workflows vanish during what should be routine human oversight.

The Infrastructure Assumption That Breaks

Traditional automation runs continuously—start to finish, log results, exit. Agents assume interruption. They pause for human oversight, wait for external services, or suspend when information is missing. During these pauses, they need to maintain their entire cognitive state: conversation history, tool outputs, intermediate reasoning, the exact execution context.

Systems designed for continuous execution don't handle cognitive states that suspend mid-stream and expect to resume as if no time has passed. Saving data solves only part of the problem.

What Production Reveals

Operating web agents across thousands of sites shows why interruption must be the default assumption. An agent monitoring hotel inventory doesn't fail because the model can't reason—it pauses because a site deployed new authentication, or rate limits kicked in, or pricing data became ambiguous. Keeping the agent's entire reasoning context alive while the world changes around it becomes the core infrastructure challenge.

Sessions automatically terminate after 15 minutes of inactivity. This timeout surfaces when workflows fail at consistent points. The infrastructure assumes "inactive" means "done," not "waiting."

Operators respond by checkpointing everything. Every workflow step writes a snapshot—not just data but the graph state, config, metadata, and next nodes to execute. When processes restart or sessions timeout, the system queries the last checkpoint and resumes. Continuous state persistence treats interruption as the default.

An agent monitoring competitive pricing pauses for manual review of an ambiguous discount structure. During the wait, the team deploys a code update that changes how pricing data is structured. When the operator approves, the agent attempts to resume—but the saved snapshot of the agent's thinking no longer matches how the new code expects data to be structured. The workflow fails silently. Operators only discover it hours later when the data pipeline shows gaps.

Experienced operators develop understanding that isn't documented:

Thread IDs function as persistent cursors—they're not just identifiers but the mechanism for loading saved states and resuming execution
Activities must be idempotent—produce the same result when executed multiple times with the same input. This matters because when systems retry after failures, they need to know that running the same step twice won't create duplicate orders or corrupted data
Ping handlers signal "healthy busy" status—not because the agent is doing work, but because the infrastructure needs proof it's still alive while waiting

Why Infrastructure Must Assume Interruption

The gap between pause and resume is where conventional infrastructure breaks. When agents need human oversight on ambiguous data, or wait for rate limits to reset, or suspend during authentication challenges—these moments define the operational reality that infrastructure must be designed around.

Building web agent infrastructure that runs reliably at scale requires different assumptions. Systems must treat suspended cognitive state as a first-class concern. Where demos can assume continuous execution, production systems must assume interruption—and keep the agent's reasoning alive through server restarts, deployments, and time.

Things to follow up on...

Context window explosion: As agents run longer, the amount of information they need to track explodes, creating challenges that can't be solved by simply giving models more space to paste text.
State serialization compatibility: When approval requests take longer and teams version agent definitions or bump SDK versions, serialized state may not be compatible across changes, requiring parallel SDK installations or custom branching logic.
The production deployment gap: Only 2% of organizations have deployed agentic AI at scale while 61% remain stuck in exploration, with Gartner predicting over 40% of projects will be canceled by end of 2027 due to escalating costs and inadequate risk controls.
Observability for non-deterministic systems: When multi-agent systems produce unexpected results, tools like LangSmith provide detailed tracing of agent decision-making processes that are as essential as debuggers for traditional programming.

The Infrastructure Assumption That Breaks

Systems designed for continuous execution don't handle cognitive states that suspend mid-stream and expect to resume as if no time has passed. Saving data solves only part of the problem.

What Production Reveals

Sessions automatically terminate after 15 minutes of inactivity. This timeout surfaces when workflows fail at consistent points. The infrastructure assumes "inactive" means "done," not "waiting."

Experienced operators develop understanding that isn't documented:

Thread IDs function as persistent cursors—they're not just identifiers but the mechanism for loading saved states and resuming execution

Activities must be idempotent—produce the same result when executed multiple times with the same input. This matters because when systems retry after failures, they need to know that running the same step twice won't create duplicate orders or corrupted data

Ping handlers signal "healthy busy" status—not because the agent is doing work, but because the infrastructure needs proof it's still alive while waiting

Why Infrastructure Must Assume Interruption

Things to follow up on...

Context window explosion: As agents run longer, the amount of information they need to track explodes, creating challenges that can't be solved by simply giving models more space to paste text.
State serialization compatibility: When approval requests take longer and teams version agent definitions or bump SDK versions, serialized state may not be compatible across changes, requiring parallel SDK installations or custom branching logic.
The production deployment gap: Only 2% of organizations have deployed agentic AI at scale while 61% remain stuck in exploration, with Gartner predicting over 40% of projects will be canceled by end of 2027 due to escalating costs and inadequate risk controls.
Observability for non-deterministic systems: When multi-agent systems produce unexpected results, tools like LangSmith provide detailed tracing of agent decision-making processes that are as essential as debuggers for traditional programming.