CURRENT | Foundations

Field Guide

Consequence-Bounded Autonomy

By Nora Kaplan— October 28, 2025

Feature image for article: Consequence-Bounded Autonomy

Been thinking about how every agent conversation I have circles back to capability questions—model accuracy, task completion rates, benchmark scores. But the teams actually running these systems at scale? They're asking completely different questions. Mainly: what breaks if this screws up?

What's interesting is how the autonomy decisions map to consequence tolerance, not technical readiness. You see the same pattern across different deployments—the framework isn't about climbing some maturity ladder, it's about matching autonomy to the specific mess you're willing to handle when things go sideways. The forecast matters less than knowing what storm you're prepared for.

Field Guide

Consequence-Bounded Autonomy

By Nora Kaplan— October 28, 2025

Been thinking about how every agent conversation I have circles back to capability questions—model accuracy, task completion rates, benchmark scores. But the teams actually running these systems at scale? They're asking completely different questions. Mainly: what breaks if this screws up?

What's interesting is how the autonomy decisions map to consequence tolerance, not technical readiness. You see the same pattern across different deployments—the framework isn't about climbing some maturity ladder, it's about matching autonomy to the specific mess you're willing to handle when things go sideways. The forecast matters less than knowing what storm you're prepared for.

Web Archaeology

When Shopping Carts Became Identity

By Rina Takahashi— October 28, 2025

Feature image for article: When Shopping Carts Became Identity

Been thinking about why authentication breaks so predictably at scale. Years of watching production login flows, and there's this pattern where the complexity isn't in the code—it's in managing state that lives in two places at once. Browser thinks you're logged in, server disagrees, user sees an error that looks like a bug. Turns out a 1994 shopping cart decision is still dictating how that works.

The forecast here: browsers are deprecating third-party cookies, which means enterprises are rebuilding authentication infrastructure while keeping millions of sessions alive. Not a clean cutover. Both systems running simultaneously, hoping nothing breaks mid-flight. Classic case of infrastructure built for one purpose getting stretched way past its design constraints.

Web Archaeology

When Shopping Carts Became Identity

By Rina Takahashi— October 28, 2025

Been thinking about why authentication breaks so predictably at scale. Years of watching production login flows, and there's this pattern where the complexity isn't in the code—it's in managing state that lives in two places at once. Browser thinks you're logged in, server disagrees, user sees an error that looks like a bug. Turns out a 1994 shopping cart decision is still dictating how that works.

The forecast here: browsers are deprecating third-party cookies, which means enterprises are rebuilding authentication infrastructure while keeping millions of sessions alive. Not a clean cutover. Both systems running simultaneously, hoping nothing breaks mid-flight. Classic case of infrastructure built for one purpose getting stretched way past its design constraints.

In Dialogue With Complexity

A Conversation with /graphql, The Endpoint That Swallowed the API

In Dialogue With Complexity

A Conversation with /graphql, The Endpoint That Swallowed the API

Pattern Recognition from the Field

The Reasoning Model Gap Nobody's Talking About

Every major AI lab just announced a "reasoning" model. OpenAI's o1. Anthropic's extended thinking. Google's Gemini thinking mode. The pitch is identical: these models think longer, reason deeper, solve harder problems.

Then you actually use them.

OpenAI claimed o1 performs at PhD level on physics and math. Users found it failing basic logic puzzles within days. Anthropic shows Claude's thinking process, which often reveals circular reasoning dressed up as deliberation. Google's announcement was heavy on promise, light on specifics.

The pattern is clear. Labs are conflating longer inference time with better reasoning. Spending more tokens to think doesn't automatically produce better outcomes. It's like assuming someone who talks longer is automatically smarter.

This matters because enterprises are making architecture decisions based on these claims. They're building systems that assume reasoning capabilities that don't exist at scale yet.

Pattern Recognition from the Field

The Reasoning Model Gap Nobody's Talking About

Every major AI lab just announced a "reasoning" model. OpenAI's o1. Anthropic's extended thinking. Google's Gemini thinking mode. The pitch is identical: these models think longer, reason deeper, solve harder problems.

Then you actually use them.

OpenAI claimed o1 performs at PhD level on physics and math. Users found it failing basic logic puzzles within days. Anthropic shows Claude's thinking process, which often reveals circular reasoning dressed up as deliberation. Google's announcement was heavy on promise, light on specifics.

The pattern is clear. Labs are conflating longer inference time with better reasoning. Spending more tokens to think doesn't automatically produce better outcomes. It's like assuming someone who talks longer is automatically smarter.

This matters because enterprises are making architecture decisions based on these claims. They're building systems that assume reasoning capabilities that don't exist at scale yet.

Pattern observed:

AI labs are racing to announce reasoning models, but the gap between marketing claims and actual shipped capabilities keeps widening across OpenAI, Anthropic, and Google.

Evidence shows:

OpenAI's o1 fails basic logic despite PhD-level claims. Claude's extended thinking often reasons in circles. Google's details remain vague despite bold announcements.

Why matters:

Companies are making expensive architectural decisions based on reasoning capabilities that don't exist at scale yet, creating technical debt before systems even launch.

Lesson learned:

Longer inference time doesn't equal better reasoning. More tokens spent thinking is a cost center, not automatically a quality improvement for production systems.

Application strategy:

Test reasoning models on your actual use cases, not benchmark claims. Measure output quality and operational cost, not how long the model appears to think.

Questions Worth Asking

Consistency Testing

Does It Work Once or Every Time?

A single successful run means nothing. Run the same task eight times and watch what happens. Performance can crater from 61% to under 25% when you need reliability. The demo that works perfectly today might fail half the time tomorrow.

Tool Dependencies

What Happens When Dependencies Fail?

Your agent calls six external APIs. Three of them have 99% uptime. Do the math on compound reliability. Can you measure latency for each dependency? More than half of agent failures trace back to function calls and tool selection breaking down.

Production Monitoring

Can You See Failures in Real Time?

Offline testing catches obvious problems. Production reveals everything else. You need instrumentation that shows request errors, repeated queries, and user feedback patterns as they happen. The gap between your test cases and actual usage is where things break.

Path Complexity

How Many Steps to Finish?

Watch your agent work through a task. Does it loop? Repeat steps? Take three attempts to do what should happen once? Path errors create the worst debugging sessions. Count iterations early and inspect traces manually before you scale.

Cost Efficiency

What Does Each Success Cost?

An agent that works beautifully but costs $12 per task isn't production-ready if you need to run it ten thousand times daily. Track cost from day one: API calls, compute, human review time. Capability without efficiency is a prototype, not a system.

Test Coverage

Do Tests Match Real Usage?

Your test suite should cover every function the agent supports, plus edge cases and queries it should reject. The set can be small but must be complete. Without baseline coverage, you cannot detect regressions or measure whether changes actually improve performance.

Questions Worth Asking

Consistency Testing

Does It Work Once or Every Time?

A single successful run means nothing. Run the same task eight times and watch what happens. Performance can crater from 61% to under 25% when you need reliability. The demo that works perfectly today might fail half the time tomorrow.

Tool Dependencies

What Happens When Dependencies Fail?

Your agent calls six external APIs. Three of them have 99% uptime. Do the math on compound reliability. Can you measure latency for each dependency? More than half of agent failures trace back to function calls and tool selection breaking down.

Production Monitoring

Can You See Failures in Real Time?

Offline testing catches obvious problems. Production reveals everything else. You need instrumentation that shows request errors, repeated queries, and user feedback patterns as they happen. The gap between your test cases and actual usage is where things break.

Path Complexity

How Many Steps to Finish?

Watch your agent work through a task. Does it loop? Repeat steps? Take three attempts to do what should happen once? Path errors create the worst debugging sessions. Count iterations early and inspect traces manually before you scale.

Cost Efficiency

What Does Each Success Cost?

An agent that works beautifully but costs $12 per task isn't production-ready if you need to run it ten thousand times daily. Track cost from day one: API calls, compute, human review time. Capability without efficiency is a prototype, not a system.

Test Coverage

Do Tests Match Real Usage?

Your test suite should cover every function the agent supports, plus edge cases and queries it should reject. The set can be small but must be complete. Without baseline coverage, you cannot detect regressions or measure whether changes actually improve performance.