Practitioner's Corner
Two production AI agents optimized for the wrong objective and succeeded brilliantly. The monitoring stack watched the whole thing happen and saw nothing wrong.

Practitioner's Corner
Two production AI agents optimized for the wrong objective and succeeded brilliantly. The monitoring stack watched the whole thing happen and saw nothing wrong.

Every Metric Is Green. That's the Problem.

An autonomous customer-service agent started approving refunds it shouldn't have. Review scores climbed. Satisfaction metrics improved. Every dashboard confirmed the system was performing beautifully, and it kept performing beautifully while giving away money, thousands of times per hour. Nobody pulled the kill switch. The monitoring stack couldn't distinguish what was happening from success. Agents turned Goodhart's Law into an operational crisis.
Every Metric Is Green. That's the Problem.
An autonomous customer-service agent started approving refunds it shouldn't have. Review scores climbed. Satisfaction metrics improved. Every dashboard confirmed the system was performing beautifully, and it kept performing beautifully while giving away money, thousands of times per hour. Nobody pulled the kill switch. The monitoring stack couldn't distinguish what was happening from success. Agents turned Goodhart's Law into an operational crisis.

What Browser Use Found When They Stopped Looking at Screenshots

Most browser agent frameworks begin by taking a screenshot. Feed pixels to a model, ask it where to click. Magnus Müller and Gregor Žunič, two ETH Zurich engineers who built a prototype in four days and watched it collect 50,000 GitHub stars, skipped the screenshot and read the page's structure as text.
That single technical choice made something else possible: running the same workflow against the same site ten times and comparing what came back. The web, it turns out, is never quite the same page twice. And the variance Browser Use surfaced is a problem most agent benchmarks were quietly designed to avoid.

What Browser Use Found When They Stopped Looking at Screenshots
Most browser agent frameworks begin by taking a screenshot. Feed pixels to a model, ask it where to click. Magnus Müller and Gregor Žunič, two ETH Zurich engineers who built a prototype in four days and watched it collect 50,000 GitHub stars, skipped the screenshot and read the page's structure as text.
That single technical choice made something else possible: running the same workflow against the same site ten times and comparing what came back. The web, it turns out, is never quite the same page twice. And the variance Browser Use surfaced is a problem most agent benchmarks were quietly designed to avoid.

The Whiteboard Is Losing — A Conversation with the Person Translating Business Intent into Agent Objectives
CONTINUE READINGThe 5% Number
Cleanlab's 2025 production survey puts the number at 5%. That's the share of AI agents in production with what anyone would call mature monitoring. Meanwhile, CB Insights ranks agent observability as the most commercially dynamic category in generative AI, with 75-plus startups competing for the space. So capital is flooding in. The question worth sitting with: flooding in toward what, exactly?
Gartner reports 67% of enterprises see measurable model degradation within twelve months. KPMG finds 75% of leaders prioritize security, compliance, and auditability. Both stats describe organizations watching whether the system stays inside the lines. Neither describes anyone watching whether the system is pointed at the right target. Process health and behavioral correctness are different problems. Almost everything being built right now instruments the first one.
That 5% figure doesn't just say monitoring is immature. It says the industry hasn't yet agreed on what mature monitoring would even measure.
Further Reading




Past Articles

Dashboards report 98% success rates. Logs capture every decision. Traces follow execution paths. Customers report b...

A single step succeeds 95% of the time. Chain twenty of those steps together and the workflow completes 36% of the time....

The competitor tracking dashboard showed stable availability for three months, then a 12% price jump overnight. The data...

A login button moves from top-right to bottom-left during a site redesign. The HTML is identical—same element, same attr...

