The agent demos look impressive. Natural language commands, autonomous execution, minimal human intervention. Then you try to deploy in production and discover that "autonomous" means something very different when real money and real customers are involved.
Cognition's Devin resolves 14% of GitHub issues autonomously—twice as good as chatbots, but nowhere near the "autonomous software engineer" framing suggests. Organizations that treat autonomy as a capability maturity ladder—start supervised, gradually increase independence, eventually reach full autonomy—discover their production deployments stall at levels far below what the technology theoretically supports.
Technical capability rarely explains this gap. What matters is consequence mapping.
What Conventional Frameworks Miss
Most autonomy frameworks focus on model confidence scores, accuracy metrics, or capability levels. How good is the agent? But at scale, technical capability is rarely the constraint. AWS Q Dev autonomously calls hundreds of APIs to diagnose and fix resource issues. Shopify's Sidekick never makes changes to merchant shops without approval. Same underlying technology, radically different autonomy choices. Both correct for their contexts.
What determines autonomy boundaries? What happens when it's wrong.
Organizations operating agents at scale converged on a similar evaluation approach. Before deciding how much autonomy to grant a workflow, they map three dimensions.
Reversibility measures whether an action can be undone with a click. Generating a draft document is reversible. Publishing content to customers isn't. Creating a database backup is reversible. Deleting a table isn't.
Blast radius tracks what gets affected if the agent gets this wrong. A single user's preference setting has minimal blast radius. A pricing change visible to all customers has maximum blast radius.
Detection speed captures how quickly someone will notice if this goes wrong. Errors in internal reports might go undetected for days. Errors in customer-facing responses get flagged immediately.
High reversibility, small blast radius, fast detection? Let the agent execute autonomously. Low reversibility, large blast radius, slow detection? Require human approval.
Klarna's AI assistant handles two-thirds of customer service conversations autonomously but escalates the remaining third to humans. Not because those queries are technically harder, but because they involve ambiguity in contexts where relationship damage is difficult to repair. The consequence mapping determined the boundary.
Why "Gradual Autonomy Increase" Fails at Scale
Organizations that try to increase autonomy uniformly discover their highest-risk workflows become the constraint. You can't run everything at the autonomy level your most sensitive operation requires without eliminating the efficiency gains that made agents valuable.
Shopify reports that systems become harder to reason about as tool counts increase from 20 to 50+. Not because the agents are less capable, but because more tools mean more potential consequence combinations, making blast radius harder to predict. The complexity isn't in the agent. It's in mapping what could go wrong.
Autonomy decisions aren't about the agent's capabilities at all. They're about the organization's tolerance for specific types of consequences in specific contexts. Consequences don't converge over time. Read-only information gathering will always have different risk profiles than destructive actions.
Evaluating Autonomy for New Workflows
When evaluating autonomy for a new workflow, the questions shift. Not "how smart is our agent?" or "what maturity level should we be at?" Instead:
- What happens if this goes wrong?
- Can we undo it?
- Who's affected?
- How quickly will we know?
Those answers determine whether the workflow runs autonomously, with human review, or requires explicit approval. The staging isn't temporal—moving from supervised to autonomous over time. It's contextual—maintaining different autonomy levels for different consequence profiles, permanently.
Organizations should evaluate agent platforms differently. The relevant questions aren't about model benchmarks or capability demonstrations. They're about observability (can you see what the agent is doing?), auditability (can you trace decisions after the fact?), and rollback mechanisms (can you undo what went wrong?). The infrastructure for managing consequences matters more than the intelligence for executing tasks.
Map consequences first. Set autonomy accordingly. Then maintain those boundaries as the system scales.
Things to follow up on...
-
Just-in-Time instructions approach: Shopify's breakthrough involved returning relevant instructions alongside tool data exactly when needed rather than cramming all guidance into system prompts, which helped maintain system comprehensibility as tool counts scaled from 20 to 50+.
-
Fine-grained authorization requirements: Organizations implementing agent systems need database access controls that set scope at the customer level and can easily check if functions or data are available within specific scopes, making fine-grained permissions a likely requirement for working with agents.
-
Cost control guardrails: Beyond action reversibility, organizations implement manual approval requirements for actions exceeding defined cost thresholds and use fast, cheap models to detect malicious usage before expensive models run, preventing both inappropriate use and budget overruns.
-
Evaluation mechanics evolution: OpenAI's expanded Evals approach mirrors what enterprises are building in-house with canonical task suites, golden answers, and regression checks across toolchains, paired with incident logging that tags prompt injections, permission denials, and overrides for continuous quality monitoring.

