A major support platform recently introduced outcome-based pricing for its AI agents. The billable unit: an automated resolution. A conversation handled entirely by AI, no escalation, ticket not reopened within 72 hours. Clean metric — and one where a frustrated customer who gives up and calls the main line registers the same as a satisfied one.
That's a specification decision. And the buyer didn't make it.
Most organizations deploying agents haven't defined what a successful outcome looks like in operational terms. The kind of terms you'd need to evaluate whether an agent's output was actually good enough. Barely half of organizations run formal evaluations on their agents. That looks like a tooling gap until you notice the deeper problem: if you haven't said what "correct" means, you can't tell whether you've achieved it, and you can't build an eval to check.
Vacuums get filled. Usually by whatever's closest.
At the protocol layer, MCP now has 97 million monthly SDK downloads and sits under the Linux Foundation. Its tool annotation system encodes defaults about what constitutes a risky interaction: destructive, non-idempotent, open-world. These are optional hints, and no client enforces them consistently. The Enterprise Working Group that would define audit trails, auth flows, and access governance? The roadmap says it's expected "to form." Future tense. The protocol most teams encounter today carries no opinion about who should access what, under what conditions, with what trail. That absence is itself a kind of specification.
Why do organizations leave this work undone? Some of it is genuinely hard in a way that compounds on itself. Defining "correct" for an agent doing complex work requires the same domain expertise the agent is supposed to augment. The person who could write the spec is often the person whose judgment the agent is meant to scale. That circularity doesn't resolve with better tooling. Some of it is organizational: writing the spec forces implicit disagreements about quality and accountability into the open. Easier to pilot indefinitely than to commit to a definition someone owns.
Infrastructure fills the gap regardless. Pricing models define what counts as a completed task. Protocol defaults define what counts as a safe tool interaction. Eval frameworks define what counts as quality. One way to read this: it's a reasonable division of labor. Organizations that will never write their own specification may genuinely be better served by thoughtful defaults than by no specification at all. AWS made S3 Block Public Access optional in 2018, then a console default, then universal in 2023. Millions of security postures shaped by a migration path. For most of those organizations, the default was better than what they would have chosen on their own.
But storage defaults governed data access. Agent infrastructure defaults are starting to govern what counts as a completed task, a trustworthy tool, a quality interaction. The surface area is wider, and the judgments are less obviously technical. Organizations with systematic evaluation frameworks see roughly six times higher production success rates. The common thread seems to be that someone sat down and defined what "good" meant before infrastructure defined it for them.
Defaults that nobody recognizes as specification decisions are the hardest ones to revisit. An organization optimizing against a vendor's definition of "resolved" might never realize the choice was already made.
Things to follow up on...
- MCP's annotation trust problem: The protocol's co-creator asked during review what value hints provide when they can't be trusted, and that question still shapes every annotation proposal today.
- The 15% readiness finding: Fivetran's 2026 Agentic AI Readiness Index found that only 15% of organizations are fully prepared for production agentic AI, even as nearly 60% report investing millions.
- Reliability lagging capability gains: A February 2026 paper found that despite steady accuracy improvements across 12 frontier models, reliability barely budged over 18 months, suggesting that specification and evaluation gaps compound at the operational layer.
- Seat-based pricing in retreat: A recent industry study found seat-based SaaS pricing fell from 21% to 15% of companies in twelve months while hybrid models surged to 41%, suggesting outcome definitions embedded in pricing are becoming the norm rather than the exception.

