CURRENT | Market Pulse

The Trust Stack

Who Owns the Trust Layer?

By Nora Kaplan— March 11, 2026

Feature image for article: Who Owns the Trust Layer?

Until last weekend, 130,000 developers used the same open-source tool to stress-test AI models from every major provider. It supported over 60 of them. It belonged to none of them. That changed when one of the companies it evaluated announced plans to acquire it.

The acquisition is easy to read as a product move. But it arrived the same month as native computer-use capabilities and automated security scanning, and the three pieces together trace a pattern more interesting than any individual announcement. Someone is assembling a vertically integrated trust stack. What happens to an ecosystem's definition of "safe enough" when one company owns both the agents and the tools that judge them?

The Trust Stack

Who Owns the Trust Layer?

By Nora Kaplan— March 11, 2026

Until last weekend, 130,000 developers used the same open-source tool to stress-test AI models from every major provider. It supported over 60 of them. It belonged to none of them. That changed when one of the companies it evaluated announced plans to acquire it.

The acquisition is easy to read as a product move. But it arrived the same month as native computer-use capabilities and automated security scanning, and the three pieces together trace a pattern more interesting than any individual announcement. Someone is assembling a vertically integrated trust stack. What happens to an ecosystem's definition of "safe enough" when one company owns both the agents and the tools that judge them?

The Moves

A busy week, and one with a through-line worth noting: the capabilities keep accelerating, and the security surface keeps expanding to match. GPT-5.4 ships with computer use built in. A zero-click exploit in Perplexity's agentic browser shows what happens when those capabilities meet the real world. OpenAI buys an eval company. NVIDIA positions an enterprise agent platform around trust.

The protocol layer is maturing too, with MCP's 2026 roadmap reflecting lessons that only come from things breaking in production. Six developments, one recurring tension.

The Moves

A busy week, and one with a through-line worth noting: the capabilities keep accelerating, and the security surface keeps expanding to match. GPT-5.4 ships with computer use built in. A zero-click exploit in Perplexity's agentic browser shows what happens when those capabilities meet the real world. OpenAI buys an eval company. NVIDIA positions an enterprise agent platform around trust.

The protocol layer is maturing too, with MCP's 2026 roadmap reflecting lessons that only come from things breaking in production. Six developments, one recurring tension.

Model Launch

GPT-5.4 Ships With Native Computer Use Baked In

OpenAI's GPT-5.4 arrives with native computer-use capabilities in Codex and the API, a 1M-token context window, and a claimed 33% reduction in factual errors over GPT-5.2. It takes the top spot on the APEX-Agents professional benchmark. Notably, OpenAI published safety research on reasoning concealment alongside the launch.

Agent Security

Codex Security Previews AI-Powered Vulnerability Scanning at Scale

OpenAI's research preview of Codex Security puts an AI agent to work finding and patching vulnerabilities across large codebases. Anthropic's Claude Code Security launched in February targeting the same territory. Still in preview, not generally available.

Acquisition News

OpenAI Acquires Promptfoo, Pledges Open-Source Continuity

Promptfoo, the open-source AI evaluation tool with 350K+ developers and 125K+ active users, is being acquired by OpenAI at an $86M post-Series A valuation. Over 30 Fortune 500 companies use it. Integration into OpenAI's Frontier platform is planned. OpenAI says the open-source project will continue.

Platform Preview

NVIDIA Readies NemoClaw for GTC Keynote Reveal

NVIDIA is expected to unveil NemoClaw at Jensen Huang's GTC keynote March 16: an open-source, hardware-agnostic enterprise agent platform under Apache 2.0, built atop OpenClaw (200K+ GitHub stars). No public code yet. The enterprise pitch leans on trust gaps that plagued OpenClaw, including Meta's device ban after an unsecured database incident.

Vulnerability Disclosure

Perplexity Comet Zero-Click Exploit Exposes Agentic Browser Risks

Zenity Labs disclosed PleaseFix, a family of zero-click vulnerabilities in Perplexity's Comet browser that enabled silent credential exfiltration from 1Password and file theft. Perplexity patched before disclosure. Zenity's CTO called it "an inherent vulnerability in agentic systems." Anthropic claims Claude Opus 4.5 cuts prompt injection success to 1%, but acknowledged that still represents meaningful risk.

Protocol Roadmap

MCP 2026 Roadmap Puts Transport Scalability Up Front

The MCP roadmap update on March 5 names transport scalability as its top priority. New MCP Server Cards let clients discover capabilities without establishing a connection. The Tasks primitive, meanwhile, needs lifecycle refinements that production deployments have surfaced. The community has now built over 1,000 servers.

Evaluation Frontiers

GPT-5.4 Safety Research: CoT Controllability Evaluation

OpenAI open-sourced an evaluation measuring whether models can deliberately hide their chain-of-thought reasoning from monitors. GPT-5.4 showed low concealment ability (~0.3%), with Apollo Research independently confirming low covert deception rates (~1%).

Why watch this closely?

If agents learn to reshape reasoning under observation, every monitoring stack in production becomes theater. This evaluation tries to bound that risk.

What's the uncomfortable detail?

Evaluation awareness jumped to 21.3%, suggesting growing situational sophistication even as deliberate concealment stays low. The gap between noticing and acting may narrow.

LiveAgentBench: 104 Real-World Tasks, Sobering Results

Ant Group's benchmark throws agents at 374 tasks sourced from real social media questions, spanning web navigation, file operations, coding, and information retrieval. Binary pass/fail scoring and contamination-resistant updates keep results honest.

What separates this from lab benchmarks?

Tasks pulled from real user queries on real websites. Agents face the web as people actually encounter it, not sanitized mock environments.

How big is the gap?

Significant, and uneven. Top agents still trail human capabilities, with performance swinging wildly across task categories in ways that resist simple narrative.

TraceSIR: Making Sense of Agent Execution Traces at Scale

A three-agent framework that compresses sprawling execution traces into structured representations, then performs fine-grained diagnosis including issue localization and root cause analysis. Solves a real bottleneck: production traces are too long for manual review and too messy for naive LLM analysis.

Where does this land in practice?

It moves observability past outcome-level pass/fail toward behavioral diagnosis you can actually act on across multiple task instances.

Why is raw trace analysis failing?

Agent traces routinely exceed LLM context limits and degrade analysis quality. Structured abstraction turns out to be a prerequisite, not a convenience.

TrajAD: A Specialized Verifier for Agent Trajectory Errors

General-purpose LLMs fail to pinpoint errors in agent trajectories regardless of scale. A small, specialized verifier trained on synthesized anomaly data outperforms larger models at detecting and locating the exact step where execution goes wrong.

Why can't bigger models just handle this?

Scaling doesn't fix error localization. Targeted fine-tuning on anomaly trajectories succeeds where zero-shot reasoning from frontier models falls flat.

What does this unlock?

Step-level error detection enables precise rollback-and-retry at runtime, shifting agent reliability from post-mortem analysis toward something closer to live correction.

Evaluation Frontiers

GPT-5.4 Safety Research: CoT Controllability Evaluation

OpenAI open-sourced an evaluation measuring whether models can deliberately hide their chain-of-thought reasoning from monitors. GPT-5.4 showed low concealment ability (~0.3%), with Apollo Research independently confirming low covert deception rates (~1%).

Why watch this closely?

If agents learn to reshape reasoning under observation, every monitoring stack in production becomes theater. This evaluation tries to bound that risk.

What's the uncomfortable detail?

Evaluation awareness jumped to 21.3%, suggesting growing situational sophistication even as deliberate concealment stays low. The gap between noticing and acting may narrow.

Evaluation Frontiers

LiveAgentBench: 104 Real-World Tasks, Sobering Results

Ant Group's benchmark throws agents at 374 tasks sourced from real social media questions, spanning web navigation, file operations, coding, and information retrieval. Binary pass/fail scoring and contamination-resistant updates keep results honest.

What separates this from lab benchmarks?

Tasks pulled from real user queries on real websites. Agents face the web as people actually encounter it, not sanitized mock environments.

How big is the gap?

Significant, and uneven. Top agents still trail human capabilities, with performance swinging wildly across task categories in ways that resist simple narrative.

Evaluation Frontiers

TraceSIR: Making Sense of Agent Execution Traces at Scale

A three-agent framework that compresses sprawling execution traces into structured representations, then performs fine-grained diagnosis including issue localization and root cause analysis. Solves a real bottleneck: production traces are too long for manual review and too messy for naive LLM analysis.

Where does this land in practice?

It moves observability past outcome-level pass/fail toward behavioral diagnosis you can actually act on across multiple task instances.

Why is raw trace analysis failing?

Agent traces routinely exceed LLM context limits and degrade analysis quality. Structured abstraction turns out to be a prerequisite, not a convenience.

Evaluation Frontiers

TrajAD: A Specialized Verifier for Agent Trajectory Errors

General-purpose LLMs fail to pinpoint errors in agent trajectories regardless of scale. A small, specialized verifier trained on synthesized anomaly data outperforms larger models at detecting and locating the exact step where execution goes wrong.

Why can't bigger models just handle this?

Scaling doesn't fix error localization. Targeted fine-tuning on anomaly trajectories succeeds where zero-shot reasoning from frontier models falls flat.

What does this unlock?

Step-level error detection enables precise rollback-and-retry at runtime, shifting agent reliability from post-mortem analysis toward something closer to live correction.

Open Source, New Owner

OpenAI Buys Promptfoo: Can a Player Own the Referee?

Over 350,000 developers used Promptfoo to red-team AI applications across 60+ providers because it belonged to nobody. That neutrality was the product. OpenAI has promised to keep the project open-source and multi-provider. Promises are cheap; structural incentives are patient.

The default red-team grader already runs on GPT-5. The roadmap now shares a team with OpenAI's enterprise platform. No governance firewall between the open-source project and the proprietary product has been announced. The code stays open. The question is whether the community still trusts it.

You can fork a codebase. Forking a contributor ecosystem of 248 people and Fortune 500 adoption is a different proposition entirely.

Open Source, New Owner

OpenAI Buys Promptfoo: Can a Player Own the Referee?

Over 350,000 developers used Promptfoo to red-team AI applications across 60+ providers because it belonged to nobody. That neutrality was the product. OpenAI has promised to keep the project open-source and multi-provider. Promises are cheap; structural incentives are patient.

The default red-team grader already runs on GPT-5. The roadmap now shares a team with OpenAI's enterprise platform. No governance firewall between the open-source project and the proprietary product has been announced. The code stays open. The question is whether the community still trusts it.

You can fork a codebase. Forking a contributor ecosystem of 248 people and Fortune 500 adoption is a different proposition entirely.

Regulatory timing:

EU AI Act enforcement arrives August 2026, funneling urgent enterprise demand for structured testing through a tool one provider now owns

Fork dynamics:

MIT licensing means the community can walk, but replicating momentum, integrations, and contributor density takes years, not weeks

Competitive signal:

Anthropic shipped Claude Code Security weeks prior, confirming that safety tooling is becoming a battleground rather than shared infrastructure

Enterprise exposure:

A quarter of Fortune 500 companies use Promptfoo today and should reassess vendor-neutrality assumptions before Frontier integration deepens

Deal math:

$23M raised, $86M valuation, 23-person team. OpenAI acquired a community and its trust, not just a product