Foundations
Conceptual clarity earned from building at scale

Foundations
Conceptual clarity earned from building at scale

Three Learning Modes That Separate Successful Agent Teams From Struggling Ones

When Harvard researchers tracked 758 consultants working with AI, they discovered a puzzle: teams using identical technology saw quality jump 40% or drop 19 points. The technology wasn't the variable.
We've watched this pattern repeat across enterprises deploying web agents. Same infrastructure, same technical capability. Some teams get dramatically better at knowing which sites can run autonomously, which need oversight. Others stay stuck, deploying agents but never learning from what happens in production. The difference isn't sophistication—it's whether teams have built the conditions that make learning possible in the first place.

Three Learning Modes That Separate Successful Agent Teams From Struggling Ones
When Harvard researchers tracked 758 consultants working with AI, they discovered a puzzle: teams using identical technology saw quality jump 40% or drop 19 points. The technology wasn't the variable.
We've watched this pattern repeat across enterprises deploying web agents. Same infrastructure, same technical capability. Some teams get dramatically better at knowing which sites can run autonomously, which need oversight. Others stay stuck, deploying agents but never learning from what happens in production. The difference isn't sophistication—it's whether teams have built the conditions that make learning possible in the first place.
Tools & Techniques

Schema Validation Catches Breaks Immediately—At a Maintenance Cost
Schema validation gives you something rare in web scraping: immediate certainty when extraction breaks. Define your rules once, catch violations instantly. A required field disappears, a price becomes text instead of numbers—you know within seconds. But every site change means someone's updating validation rules, often at 2am when they break. That clarity about what failed comes with constant maintenance work, and at scale, the burden becomes real operational overhead.

Statistical Validation Adapts to Scale—But Misses Sudden Breaks
Build baselines from actual extraction history, let them adapt as sites evolve—statistical validation handles the web's constant changes without manual rule updates. A/B tests and gradual shifts don't trigger false alarms because the system learns what normal variation looks like. But when a site completely redesigns overnight, statistical methods take days to flag the problem. The adaptation that makes this approach scale effortlessly also means missing what schemas catch instantly: sudden structural failures.

An Interview With the Fundamental Problem Every Standards Body Faces
CONTINUE READINGPattern Recognition
Late 2024 brought reasoning models. Everyone expected better math. What actually happened: reliable tool calling at scale.
Coding agents need hundreds of tool invocations across expanding context windows without breaking down. SWE-Bench scores jumped from Devin's 13.86% in early 2024 to 80%+ by 2025. The architecture that emerged: reasoning models plan workflows, cheaper models execute tasks. Training against verifiable rewards taught models to decompose problems into steps. That capability consumed compute originally meant for pretraining. Most 2025 progress came from longer RL runs, not bigger base models.

