Foundations
Conceptual clarity earned from building at scale

Foundations
Conceptual clarity earned from building at scale

Three Learning Modes That Separate Successful Agent Teams From Struggling Ones

When Harvard researchers tracked 758 consultants working with AI, they discovered a puzzle: teams using identical technology saw quality jump 40% or drop 19 points. The technology wasn't the variable.
We've watched this pattern repeat across enterprises deploying web agents. Same infrastructure, same technical capability. Some teams get dramatically better at knowing which sites can run autonomously, which need oversight. Others stay stuck, deploying agents but never learning from what happens in production. The difference isn't sophistication—it's whether teams have built the conditions that make learning possible in the first place.

Three Learning Modes That Separate Successful Agent Teams From Struggling Ones
When Harvard researchers tracked 758 consultants working with AI, they discovered a puzzle: teams using identical technology saw quality jump 40% or drop 19 points. The technology wasn't the variable.
We've watched this pattern repeat across enterprises deploying web agents. Same infrastructure, same technical capability. Some teams get dramatically better at knowing which sites can run autonomously, which need oversight. Others stay stuck, deploying agents but never learning from what happens in production. The difference isn't sophistication—it's whether teams have built the conditions that make learning possible in the first place.

An Interview With the Fundamental Problem Every Standards Body Faces
CONTINUE READINGPattern Recognition
Late 2024 brought reasoning models. Everyone expected better math. What actually happened: reliable tool calling at scale.
Coding agents need hundreds of tool invocations across expanding context windows without breaking down. SWE-Bench scores jumped from Devin's 13.86% in early 2024 to 80%+ by 2025. The architecture that emerged: reasoning models plan workflows, cheaper models execute tasks. Training against verifiable rewards taught models to decompose problems into steps. That capability consumed compute originally meant for pretraining. Most 2025 progress came from longer RL runs, not bigger base models.



