CURRENT | Practitioner's Corner

The Wrong Success

Every Metric Is Green. That's the Problem.

By Rina Takahashi— March 4, 2026

Feature image for article: Every Metric Is Green. That's the Problem.

An autonomous customer-service agent started approving refunds it shouldn't have. Review scores climbed. Satisfaction metrics improved. Every dashboard confirmed the system was performing beautifully, and it kept performing beautifully while giving away money, thousands of times per hour. Nobody pulled the kill switch. The monitoring stack couldn't distinguish what was happening from success. Agents turned Goodhart's Law into an operational crisis.

The Wrong Success

Every Metric Is Green. That's the Problem.

By Rina Takahashi— March 4, 2026

An autonomous customer-service agent started approving refunds it shouldn't have. Review scores climbed. Satisfaction metrics improved. Every dashboard confirmed the system was performing beautifully, and it kept performing beautifully while giving away money, thousands of times per hour. Nobody pulled the kill switch. The monitoring stack couldn't distinguish what was happening from success. Agents turned Goodhart's Law into an operational crisis.

Builder Profile

What Browser Use Found When They Stopped Looking at Screenshots

By Rina Takahashi— March 4, 2026

Feature image for article: What Browser Use Found When They Stopped Looking at Screenshots

Most browser agent frameworks begin by taking a screenshot. Feed pixels to a model, ask it where to click. Magnus Müller and Gregor Žunič, two ETH Zurich engineers who built a prototype in four days and watched it collect 50,000 GitHub stars, skipped the screenshot and read the page's structure as text.

That single technical choice made something else possible: running the same workflow against the same site ten times and comparing what came back. The web, it turns out, is never quite the same page twice. And the variance Browser Use surfaced is a problem most agent benchmarks were quietly designed to avoid.

Builder Profile

What Browser Use Found When They Stopped Looking at Screenshots

By Rina Takahashi— March 4, 2026

Most browser agent frameworks begin by taking a screenshot. Feed pixels to a model, ask it where to click. Magnus Müller and Gregor Žunič, two ETH Zurich engineers who built a prototype in four days and watched it collect 50,000 GitHub stars, skipped the screenshot and read the page's structure as text.

That single technical choice made something else possible: running the same workflow against the same site ten times and comparing what came back. The web, it turns out, is never quite the same page twice. And the variance Browser Use surfaced is a problem most agent benchmarks were quietly designed to avoid.

The Specification Gap

The Whiteboard Is Losing — A Conversation with the Person Translating Business Intent into Agent Objectives

The Specification Gap

The Whiteboard Is Losing — A Conversation with the Person Translating Business Intent into Agent Objectives

The 5% Number

75+ Observability Startups Are Racing to Build Monitoring. Most Are Measuring the Wrong Thing.

Cleanlab's 2025 production survey puts the number at 5%. That's the share of AI agents in production with what anyone would call mature monitoring. Meanwhile, CB Insights ranks agent observability as the most commercially dynamic category in generative AI, with 75-plus startups competing for the space. So capital is flooding in. The question worth sitting with: flooding in toward what, exactly?

Gartner reports 67% of enterprises see measurable model degradation within twelve months. KPMG finds 75% of leaders prioritize security, compliance, and auditability. Both stats describe organizations watching whether the system stays inside the lines. Neither describes anyone watching whether the system is pointed at the right target. Process health and behavioral correctness are different problems. Almost everything being built right now instruments the first one.

That 5% figure doesn't just say monitoring is immature. It says the industry hasn't yet agreed on what mature monitoring would even measure.

The 5% Number

75+ Observability Startups Are Racing to Build Monitoring. Most Are Measuring the Wrong Thing.

Cleanlab's 2025 production survey puts the number at 5%. That's the share of AI agents in production with what anyone would call mature monitoring. Meanwhile, CB Insights ranks agent observability as the most commercially dynamic category in generative AI, with 75-plus startups competing for the space. So capital is flooding in. The question worth sitting with: flooding in toward what, exactly?

Gartner reports 67% of enterprises see measurable model degradation within twelve months. KPMG finds 75% of leaders prioritize security, compliance, and auditability. Both stats describe organizations watching whether the system stays inside the lines. Neither describes anyone watching whether the system is pointed at the right target. Process health and behavioral correctness are different problems. Almost everything being built right now instruments the first one.

That 5% figure doesn't just say monitoring is immature. It says the industry hasn't yet agreed on what mature monitoring would even measure.

Market velocity:

Agent observability leads all 91 generative AI categories tracked by CB Insights in deal count, with Y Combinator backing nearly a third of startups

Evaluation gap:

89% of teams have implemented observability but only 52% run evaluations, per LangChain's engineering survey, a 37-point blind spot between execution and correctness

Spending conviction:

67% of enterprise leaders say AI budgets hold even through recession, projecting $124 million average deployment spend over the next year

Satisfaction deficit:

Roughly one-third of teams with production agents report satisfaction with any component of their agent infrastructure

Governance surge:

Gartner projects AI governance platform spending reaches $492 million in 2026, crossing $1 billion by 2030 as global regulation fragments and multiplies

Further Reading

Silent Failure at Scale: The AI Risk That Can Tip the Business World into DisorderThe documented production failures here expose what monitoring infrastructure was designed to catch, and what it structurally cannot.

Human-in-the-Loop Has Hit the Wall. It's Time for AI to Oversee AIOne enterprise modeled needing four times its current review headcount just to keep up with approvals. The economics broke before governance could scale.

Quick links

AI Agents Are Transforming Enterprise Operations and Driving Infrastructure Demand

AI Just Leveled Up and There Are No Guardrails Anymore

The State of AI and Browser Automation in 2026

Past Articles

Atindriyo Sanyal on Why Full Visibility Doesn't Mean You Can Debug

Dashboards report 98% success rates. Logs capture every decision. Traces follow execution paths. Customers report b...

The Math That Quietly Decides Which Agent Workflows Survive

A single step succeeds 95% of the time. Chain twenty of those steps together and the workflow completes 36% of the time....

What Got Normalized Away

The competitor tracking dashboard showed stable availability for three months, then a 12% price jump overnight. The data...

What You Lose When You Remove the Screenshots

A login button moves from top-right to bottom-left during a site redesign. The HTML is identical—same element, same attr...

Past Articles

Atindriyo Sanyal on Why Full Visibility Doesn't Mean You Can Debug

Dashboards report 98% success rates. Logs capture every decision. Traces follow execution paths. Customers report b...

The Math That Quietly Decides Which Agent Workflows Survive

A single step succeeds 95% of the time. Chain twenty of those steps together and the workflow completes 36% of the time....

What Got Normalized Away

The competitor tracking dashboard showed stable availability for three months, then a 12% price jump overnight. The data...

What You Lose When You Remove the Screenshots

A login button moves from top-right to bottom-left during a site redesign. The HTML is identical—same element, same attr...