CURRENT | Foundations

Field Guide

Three Learning Modes That Separate Successful Agent Teams From Struggling Ones

By Nora Kaplan— January 7, 2026

Feature image for article: Three Learning Modes That Separate Successful Agent Teams From Struggling Ones

When Harvard researchers tracked 758 consultants working with AI, they discovered a puzzle: teams using identical technology saw quality jump 40% or drop 19 points. The technology wasn't the variable.

We've watched this pattern repeat across enterprises deploying web agents. Same infrastructure, same technical capability. Some teams get dramatically better at knowing which sites can run autonomously, which need oversight. Others stay stuck, deploying agents but never learning from what happens in production. The difference isn't sophistication—it's whether teams have built the conditions that make learning possible in the first place.

Field Guide

Three Learning Modes That Separate Successful Agent Teams From Struggling Ones

By Nora Kaplan— January 7, 2026

When Harvard researchers tracked 758 consultants working with AI, they discovered a puzzle: teams using identical technology saw quality jump 40% or drop 19 points. The technology wasn't the variable.

We've watched this pattern repeat across enterprises deploying web agents. Same infrastructure, same technical capability. Some teams get dramatically better at knowing which sites can run autonomously, which need oversight. Others stay stuck, deploying agents but never learning from what happens in production. The difference isn't sophistication—it's whether teams have built the conditions that make learning possible in the first place.

Tools & Techniques

Tools in Context

Schema Validation Catches Breaks Immediately—At a Maintenance Cost

Schema validation gives you something rare in web scraping: immediate certainty when extraction breaks. Define your rules once, catch violations instantly. A required field disappears, a price becomes text instead of numbers—you know within seconds. But every site change means someone's updating validation rules, often at 2am when they break. That clarity about what failed comes with constant maintenance work, and at scale, the burden becomes real operational overhead.

Tools in Context

Statistical Validation Adapts to Scale—But Misses Sudden Breaks

Build baselines from actual extraction history, let them adapt as sites evolve—statistical validation handles the web's constant changes without manual rule updates. A/B tests and gradual shifts don't trigger false alarms because the system learns what normal variation looks like. But when a site completely redesigns overnight, statistical methods take days to flag the problem. The adaptation that makes this approach scale effortlessly also means missing what schemas catch instantly: sudden structural failures.

Tools & Techniques

Tools in Context

Schema Validation Catches Breaks Immediately—At a Maintenance Cost

Schema validation gives you something rare in web scraping: immediate certainty when extraction breaks. Define your rules once, catch violations instantly. A required field disappears, a price becomes text instead of numbers—you know within seconds. But every site change means someone's updating validation rules, often at 2am when they break. That clarity about what failed comes with constant maintenance work, and at scale, the burden becomes real operational overhead.

Tools in Context

Statistical Validation Adapts to Scale—But Misses Sudden Breaks

Build baselines from actual extraction history, let them adapt as sites evolve—statistical validation handles the web's constant changes without manual rule updates. A/B tests and gradual shifts don't trigger false alarms because the system learns what normal variation looks like. But when a site completely redesigns overnight, statistical methods take days to flag the problem. The adaptation that makes this approach scale effortlessly also means missing what schemas catch instantly: sudden structural failures.

Tools & Techniques

Tools in Context

Schema Validation Catches Breaks Immediately—At a Maintenance Cost

Schema validation gives you something rare in web scraping: immediate certainty when extraction breaks. Define your rules once, catch violations instantly. A required field disappears, a price becomes text instead of numbers—you know within seconds. But every site change means someone's updating validation rules, often at 2am when they break. That clarity about what failed comes with constant maintenance work, and at scale, the burden becomes real operational overhead.

Tools in Context

Statistical Validation Adapts to Scale—But Misses Sudden Breaks

Build baselines from actual extraction history, let them adapt as sites evolve—statistical validation handles the web's constant changes without manual rule updates. A/B tests and gradual shifts don't trigger false alarms because the system learns what normal variation looks like. But when a site completely redesigns overnight, statistical methods take days to flag the problem. The adaptation that makes this approach scale effortlessly also means missing what schemas catch instantly: sudden structural failures.

In Dialogue With Complexity

An Interview With the Fundamental Problem Every Standards Body Faces

In Dialogue With Complexity

An Interview With the Fundamental Problem Every Standards Body Faces

Pattern Recognition

Reasoning Models Changed How Agents Actually Work

Late 2024 brought reasoning models. Everyone expected better math. What actually happened: reliable tool calling at scale.

Coding agents need hundreds of tool invocations across expanding context windows without breaking down. SWE-Bench scores jumped from Devin's 13.86% in early 2024 to 80%+ by 2025. The architecture that emerged: reasoning models plan workflows, cheaper models execute tasks. Training against verifiable rewards taught models to decompose problems into steps. That capability consumed compute originally meant for pretraining. Most 2025 progress came from longer RL runs, not bigger base models.

Pattern Recognition

Reasoning Models Changed How Agents Actually Work

Late 2024 brought reasoning models. Everyone expected better math. What actually happened: reliable tool calling at scale.

Coding agents need hundreds of tool invocations across expanding context windows without breaking down. SWE-Bench scores jumped from Devin's 13.86% in early 2024 to 80%+ by 2025. The architecture that emerged: reasoning models plan workflows, cheaper models execute tasks. Training against verifiable rewards taught models to decompose problems into steps. That capability consumed compute originally meant for pretraining. Most 2025 progress came from longer RL runs, not bigger base models.

Industry pattern:

Agents now split planning from execution, using reasoning models upfront and cheaper models for tasks.

Concrete evidence:

Qwen3-Coder matches Claude Sonnet 4 performance while open-weights models slash inference costs dramatically.

Core insight:

RLVR training spontaneously developed step-by-step decomposition that looks like human reasoning without being explicitly taught.

Production lesson:

Reliability at scale matters more than raw capability when agents make hundreds of sequential decisions.

Building guidance:

Design agent architectures assuming planning and execution need different model capabilities and cost profiles.

Questions Worth Asking

Seventy percent of agent deployments fail. The questions that predict which side of that statistic you'll land on have nothing to do with capabilities or features. They're about operational realities that only become obvious after you've shipped to production and watched things break in ways traditional software never does.

We've learned to ask these questions early. Observability becomes an emergency retrofit. Coordination overhead consumes the benefits of parallelization. Audit requirements surface after architecture is locked. These patterns repeat across teams because the right questions reveal what actually predicts success at scale.

Questions Worth Asking

Seventy percent of agent deployments fail. The questions that predict which side of that statistic you'll land on have nothing to do with capabilities or features. They're about operational realities that only become obvious after you've shipped to production and watched things break in ways traditional software never does.

We've learned to ask these questions early. Observability becomes an emergency retrofit. Coordination overhead consumes the benefits of parallelization. Audit requirements surface after architecture is locked. These patterns repeat across teams because the right questions reveal what actually predicts success at scale.

Observability Infrastructure

When Does Observability Get Built?

Most teams add observability after deployment. Agents operating as black boxes make diagnosis impossible when issues emerge. You'll need comprehensive tracing eventually. The real choice is building it from day one or retrofitting it during your first production incident.

Testing Strategy

What Happens When Tests Pass Differently?

Traditional CI/CD expects identical inputs to produce identical outputs. Agents generate different responses to the same query based on temperature, context windows, or model updates. Pass/fail testing catches complete failures but misses quality variations that actually matter.

Architecture Decisions

Where Does Coordination Cost More Than Parallelization Provides?

Multi-agent systems incur coordination overhead that scales non-linearly with agent count. Each handoff adds latency through serialization, network transfer, and state synchronization. Beyond certain thresholds, coordination consumes more resources than the parallelization provides.

Cost Management

Can You Track Per-Request Costs Across Providers?

Agents rely on LLM calls billed per token and external APIs with varying pricing. Without granular tracking across models and providers, costs spiral unpredictably at scale. You need to identify which requests, which models, and which workflows drive your costs.

Governance Requirements

What Audit Trail Exists for Agent Decisions?

Regulated industries require comprehensive logs showing why agents made specific recommendations. Without detailed traces of reasoning and actions, compliance becomes impossible. High-impact decisions need human approval until reliability is proven. Can you reconstruct the complete decision path months later?

Data Architecture

Does Your Storage Handle Unpredictable Agent Workloads?

Agents spin up thousands of short-lived workloads, run parallel experiments, and tear down datasets unpredictably. Traditional data platforms built for stable, predictable demands struggle with this pattern. Most teams discover this requirement after architecture decisions are locked in.