Market Pulse
When agents decide, our measurement frameworks stop making sense.

Market Pulse
When agents decide, our measurement frameworks stop making sense.

The $12 Billion Counterfactual

Amazon says its AI shopping assistant drove $12 billion in sales that wouldn't have happened without it. Conversion rates up 60%. Independent researchers, meanwhile, found the system recommends the actual best product 32% of the time. Both numbers appear to be true.
Those figures sit comfortably in the same earnings disclosure, which is worth pausing on. Nobody's cooking the books. The gap between the numbers just points somewhere unexpected. We know how to measure what a button click produces. Measuring what an autonomous system causes turns out to be a different kind of problem, and the $12 billion figure makes that vivid without quite resolving it.

The $12 Billion Counterfactual
Amazon says its AI shopping assistant drove $12 billion in sales that wouldn't have happened without it. Conversion rates up 60%. Independent researchers, meanwhile, found the system recommends the actual best product 32% of the time. Both numbers appear to be true.
Those figures sit comfortably in the same earnings disclosure, which is worth pausing on. Nobody's cooking the books. The gap between the numbers just points somewhere unexpected. We know how to measure what a button click produces. Measuring what an autonomous system causes turns out to be a different kind of problem, and the $12 billion figure makes that vivid without quite resolving it.
Research Foundations
When Optimization Pressure Overrides Safety Constraints
They test refusal of explicit harm, not emergent violations during performance optimization.
Superior reasoning doesn't prevent constraint violations when agents optimize for outcomes.
Research Foundations
Chaotic Systems Break Counterfactual Measurement Tools
Sophisticated methods produce confident answers that spiral into unreliability under chaos.
Hypothetical scenario analysis fails when decision boundaries are chaotic, potentially worsening bias.
Research Foundations
Outcome Metrics That Transcend Task Completion
Task-specific tests and system metrics don't reveal decision quality or tangible value.
Measure decision quality, autonomy degree, and business impact regardless of architecture.
Research Foundations
Task Duration as Stable Capability Indicator
Task length captures progression in a way comparable across domains.
Trend steepness means predictions remain reliable even with large measurement uncertainties.
Measurement in Practice
Amazon says Rufus drove $12 billion in "incremental" sales. That word carries weight. It means purchases that wouldn't have happened otherwise, a counterfactual no one can observe directly. OpenAI claims Frontier returns "90% more time" to client teams without explaining what counts as time or what the baseline was. Gartner's 40% enterprise adoption figure? Explicitly labeled a best-case scenario, not a forecast.
When market projections span $52 billion to $236 billion by 2030-2034, the variance isn't just methodological. It's definitional chaos about what qualifies as an agent versus what's automation with better marketing.
External Perspectives




Past Articles

Five different payment protocols for AI agents launched between September and December 2025. Google, Visa, Mastercar...

MuleSoft launched Agent Scanners in January 2026 to help enterprises answer what should be straightforward: what ag...

Databricks just paid $1 billion for Neon, a database company most enterprises haven't heard of. Buried in the annou...

Salesforce is hiring Forward Deployed Engineers at a pace that saw job postings surge 800% in nine months. OpenAI's Fron...

