Amazon's Q4 2025 earnings included a number worth sitting with: Rufus, the company's AI shopping assistant, generated nearly $12 billion in incremental annualized sales during 2025. Not $12 billion in sales it touched. Incremental sales. Purchases that, in Amazon's framing, "customers might not have made without Rufus assistance."
That's a measurement of something that didn't happen. A $12 billion claim about absence.
Amazon builds this figure through what it calls "downstream impact," using a seven-day attribution window to track purchases following Rufus interactions. The supporting evidence looks reasonable on its face: customers who engage with Rufus are 60% more likely to purchase. Over 300 million customers used it last year. But then you set those numbers next to independent research showing Rufus recommendations are 83% self-serving, favoring Amazon's own products, and match the actual "best product" only 32% of the time. The $12 billion is presented as value delivered to customers, but the system is measurably optimizing for Amazon's catalog. Value for whom becomes part of the counting problem.
Conversion up 60%. Accuracy at 32%. Both apparently true.
These numbers don't contradict each other, which is precisely what makes them interesting. "Value" and "accuracy" have quietly decoupled. The system is measurably driving purchases while measurably steering people away from the best products. So what, exactly, is the $12 billion measuring?
When you A/B test a checkout button, the measurement framework assumes a clean causal chain. Click the blue button, see the result. Click the green one, see the other. Even with known limitations around context and timing, the underlying logic holds because the software is deterministic. Input produces output.
Rufus interprets questions, selects products, frames comparisons, guides attention. The causal relationship between interaction and purchase looks nothing like a clean chain. Did the customer buy because Rufus recommended well? Because the conversation narrowed options in a way that reduced decision fatigue? And then there's the simplest challenge, the one that should probably trouble the $12 billion figure most: the correlation-versus-causation question of whether people who talk to a shopping chatbot were already closer to buying. If Rufus attracts high-intent shoppers and then takes credit for their purchases, the attribution window is capturing proximity as easily as cause.
The conceptual ground here gets genuinely soft. Counterfactual measurement requires defining what would have happened without the system. What does that world look like? As AI explainability researchers have noted, those counterfactual scenarios are inherently "very atypical," with attributions "highly sensitive" to baseline choices. Small changes in how you define the world-without-the-agent produce dramatically different value estimates.
And we're reaching for measurement frameworks that weren't built for this. We built them for deterministic systems where a human decides and the software executes. Attribution was already hard there, with conflicts in 35% of conversions when campaigns run across multiple platforms. What happens when the system itself is making decisions, shaping the interaction, choosing what to show? The precision of a number like $12 billion can make you forget to ask how much interpretive scaffolding holds it up.
Rufus is clearly doing something. Millions of people use it. Some meaningful fraction of them buy things they wouldn't have found otherwise. But the gap between what organizations are deploying and what they can rigorously measure keeps widening. A 60% conversion lift and a 32% accuracy rate sit side by side in the data, and we don't yet have measurement language that can hold both without flattening one into the other.
That coexistence might be the most honest thing about the whole disclosure. The $12 billion carries the precision of a settled figure. Spend any time with the methodology and the ground underneath it is still moving.
Things to follow up on...
-
Half of agents ungoverned: A Gravitee study found over 3 million AI agents operating within US and UK corporations, with 53% not actively monitored or secured, raising the question of how organizations measure value from systems they can't even observe.
-
Gartner's cancellation forecast: Even as 40% of enterprise apps are projected to embed AI agents by end of 2026, Gartner predicts more than 40% of agentic AI projects will be canceled by 2027 due to escalating costs, unclear business value, or inadequate risk controls.
-
Agent management as infrastructure: Gartner called agent management platforms "the most valuable real estate in AI," projecting enterprise spending on the category will grow from less than $5 million today to $15 billion by 2029, a 3,000x increase driven partly by the need to actually measure what agents do.
-
Autonomy spectrum emerging: Deloitte outlines a progressive framework of humans in the loop, on the loop, and out of the loop, predicting that the most advanced businesses will begin shifting toward human-on-the-loop orchestration in 2026, each level carrying different measurement and accountability challenges.

