Companies using governance tools deploy twelve times more AI projects to production, according to Databricks telemetry across 20,000 customers. That number is measuring something real. It's worth being precise about what.
The governance in that figure is access controls, rate limiting, cost tracking, audit logging, runtime policy enforcement. These are mechanisms that make AI deployments organizationally legible. They reduce the friction between a working prototype and a sanctioned production system. They have nothing to say about whether the agent's outputs are correct, reliable, or safe. A Databricks executive noted that governance tends to be "trailing behavior," following initial adoption. Organizations already succeeding at deployment subsequently invest in governance tooling. The 12x correlation captures organizational maturity. It may not capture causation at all.
Now hold that alongside a finding from the AI Agent Index: only 4 of 13 frontier-autonomy agents disclose safety evaluations specific to their agentic deployment. Twenty-five of 30 agents surveyed publish no internal safety results. Twenty-three have no third-party testing documentation.
Authorization and behavioral evaluation share a word, but they belong to different problems. The governance infrastructure accumulating across the industry addresses the first with increasing sophistication. The second has barely any infrastructure at all.
The practitioner data traces the split precisely. In LangChain's survey of over 1,300 builders, 89% have implemented observability for their agents. Only 52% run offline evaluations. Observability records what happened; evaluation asks whether what happened was any good. The gap between 89% and 52% is the gap between organizational legibility and quality assurance, rendered in a single survey.
Governance infrastructure gets built because it maps to roles that already exist on the org chart. Compliance officers know what they need. Procurement knows what to buy. IT security has budget and authority. Behavioral quality assurance for agents maps to no existing function. The evaluation methods are, in the Five Eyes' own phrasing, "still evolving" and "only partially capture real-world deployment conditions." Agent-specific safety evaluation depends on deployment context, resists standardization, and has no natural organizational owner. The function isn't being refused. The buyer for it hasn't materialized.
The Five Eyes guidance surfaces a detail that complicates the governance story further: most agent logs are self-reported and mutable. An auditor requesting proof of what an agent did receives a document the system wrote about itself. Governance infrastructure can mandate that these artifacts exist. It cannot independently verify they're complete.
Governance infrastructure that satisfies audit requirements may actively reduce the organizational pressure to address behavioral evaluation. The compliance box is checked. The urgency dissipates.
And here the two data points start suggesting a trajectory. The audit trail looks complete. Raising the question of whether the agent's outputs are actually good becomes harder once the deployment looks governable. The Stanford FMTI found the same pattern at the industry level: companies disclose capability evaluations while post-deployment impact remains the weakest transparency category. What can be measured gets reported. What would require independent verification grows quieter over time.
Twelve times more projects reaching production is a genuine achievement. The infrastructure making deployment faster and the infrastructure that would make deployment trustworthy may be structurally different things. And by satisfying the organizational need for legibility, the first may be making the second feel less urgent than it is.
Things to follow up on...
- Five Eyes assume unexpected behavior: The joint agentic AI security guidance from five governments recommends restricting agents to low-risk, non-sensitive tasks until evaluation maturity is demonstrated, a position most deploying organizations have not explicitly documented.
- Evals lag observability everywhere: LangChain's State of Agent Engineering survey found that while 89% of teams have tracing in place, only 37% run online evaluations, suggesting the industry has instrumented the easy half of production readiness.
- Transparency scores are falling: The Stanford FMTI's average score dropped from 58 to 40 between 2024 and 2025, with training data, compute resources, and post-deployment impact remaining the weakest disclosure categories across major model developers.
- Databricks governance is platform-specific: The 12x figure is drawn from Databricks' own Unity AI Gateway telemetry, meaning "governance" in this context maps to a specific product's access controls and audit logging rather than governance programs in general.

