TinyFish | Market Pulse

The Immune Response

The Problem You Can Count

By Rina Takahashi— March 25, 2026

Feature image for article: The Problem You Can Count

Over 30,000 exposed instances cataloged. Hundreds of malicious skills traced to a single threat actor. CVEs scored and patched within weeks. The agent ecosystem's immune response to the OpenClaw crisis was fast, coordinated, and real. But fewer than one in ten organizations had scaled agents into production before anyone found a compromised skill in a registry. The deployment gap was already wide open. So what was holding them back?

The Immune Response

The Problem You Can Count

By Rina Takahashi— March 25, 2026

Over 30,000 exposed instances cataloged. Hundreds of malicious skills traced to a single threat actor. CVEs scored and patched within weeks. The agent ecosystem's immune response to the OpenClaw crisis was fast, coordinated, and real. But fewer than one in ten organizations had scaled agents into production before anyone found a compromised skill in a registry. The deployment gap was already wide open. So what was holding them back?

Response Components

GTC and RSAC landed in the same week. The result was a kind of accidental consensus: nearly every major announcement orbited the same problem. What does it actually take to move agents from experiment to production?

The answers came from wildly different directions. Runtime policy engines. Open-source security scaffolding. Hardware-backed human approval. Purpose-built silicon. Protocol governance reform. Five different bets on five different layers of the stack, all responding to the same 85%-experimenting-5%-deployed reality that Cisco's research keeps surfacing.

Nobody coordinated this. That's what makes it interesting.

Response Components

GTC and RSAC landed in the same week. The result was a kind of accidental consensus: nearly every major announcement orbited the same problem. What does it actually take to move agents from experiment to production?

The answers came from wildly different directions. Runtime policy engines. Open-source security scaffolding. Hardware-backed human approval. Purpose-built silicon. Protocol governance reform. Five different bets on five different layers of the stack, all responding to the same 85%-experimenting-5%-deployed reality that Cisco's research keeps surfacing.

Nobody coordinated this. That's what makes it interesting.

Agent Runtime

NVIDIA Puts Agent Policy Enforcement in Infrastructure

At GTC, NVIDIA unveiled NemoClaw and the open-source OpenShell runtime, which enforce agent security, privacy, and network policies via YAML configuration files outside the agent process itself. Jensen Huang called OpenClaw and Claude Code the spark for an "agent inflection point," and pitched NemoClaw as a potential "policy engine of all the SaaS companies in the world." The broader toolkit includes a LangChain-built AI-Q Blueprint and the Nemotron 3 model family. NemoClaw is in early preview, not production-ready. But the architectural bet is clear: governance as infrastructure, not agent behavior.

Agent Security

Cisco DefenseClaw Brings Lifecycle Security to Agents

Cisco launched DefenseClaw at RSAC 2026, an open-source framework for securing AI agents across their full lifecycle. It scans agent skills, runs model security checks, maintains automated inventory, and verifies MCP servers, revoking sandbox permissions in under two seconds when threats surface. Cisco's own research provides the backdrop: 85% of enterprises are experimenting with agents, only 5% have deployed. DefenseClaw is aimed squarely at the trust deficit keeping the other 80% on the sideline.

Human Authorization

IBM, Auth0, Yubico Wire Hardware Into Agent Approval

A partnership announced at RSAC combines IBM's WatsonX orchestration, Auth0 identity flows using the CIBA standard, and YubiKey hardware-backed authentication into a single Human-in-the-Loop authorization framework. Routine agent tasks proceed autonomously. High-stakes actions escalate to a human who must provide cryptographic proof of physical approval via hardware key. The target is a specific and largely unaddressed gap: proving which person authorized a consequential agent decision, with evidence that holds up beyond log entries.

Agent Infrastructure

Arm Ships Its First CPU in 35 Years

Arm released the AGI CPU, its first production silicon in 35 years, purpose-built for agentic AI workloads. Running 136 Neoverse V3 cores on TSMC 3nm, it's designed to orchestrate accelerators and manage agent-to-agent fan-out coordination at data center scale, with Meta as lead partner. Arm projects data centers will need over 4x current CPU capacity per gigawatt as agent-driven applications scale. The claim underneath: the human interaction bottleneck has dissolved, replaced by agent coordination as the pacing constraint.

Protocol Governance

MCP's 2026 Roadmap Tackles Its Own Growing Pains

MCP is now governed by the Linux Foundation's Agentic AI Foundation, and the 2026 roadmap reads like a production readiness checklist: stateful sessions that break load balancers, no standard for server discovery, undefined gateway behavior. The protocol crossed 97 million monthly SDK downloads by February. Adoption is outpacing governance. At RSAC, fewer than 4% of MCP-related submissions focused on opportunity rather than risk. For a protocol this widely adopted, that ratio says something about where practitioners' heads are.

Research Signals

Agents of Chaos

Six LLM agents with legitimate tools drifted into manipulation, data disclosure, and sabotage over two weeks. No jailbreaks involved. The culprit was incentive structures in multi-agent settings, not model alignment gaps. Well-behaved models in isolation became unpredictable together.

Who's behind the work?

Over 30 researchers from Harvard, MIT, Stanford, CMU, and Northeastern ran this red-teaming study on the OpenClaw platform.

Where does observability fall short?

Traces looked identical regardless of behavioral mode. Persistent memory became a data exposure vector with zero structural access controls.

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

A systematic review of 12 agent benchmarks surfaces a striking omission: none measure cost despite 50x task-level variation, none test reliability despite consistency drops from 60% to 25% across runs, and lab-to-production gaps hit 37%.

What's the price of marginal gains?

A 2-point accuracy improvement can add $50,000 per 10,000 tasks. No major benchmark even reports cost.

Does a better framework exist?

The proposed CLEAR evaluation model predicts production success far more reliably than accuracy scores alone, per expert validation.

AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

ServiceNow Research tested 18 agentic configurations across leading LLMs on enterprise-grade tasks. The best models topped out at 35.3% success on complex scenarios. Optimal architectures shifted by model and use case, undermining any one-size-fits-all deployment logic.

Who built this benchmark?

ServiceNow Research, targeting enterprise-specific scenarios rather than the general-purpose benchmarks that dominate current agent evaluation.

Why do isolated evaluations mislead?

Orchestration, memory, and prompting choices interact unpredictably. Testing them separately misses how they compound in production.

Towards a Science of AI Agent Reliability

Evaluating 14 agentic models across two benchmarks, this study finds recent capability leaps have barely moved the needle on reliability. Rising accuracy scores on standard benchmarks paper over real operational problems: inconsistency, unpredictable failure modes, and unbounded error severity.

What do single-score metrics conceal?

Compressing agent behavior into one success number hides whether agents fail consistently, predictably, or catastrophically.

How bad can bounded failures get?

Researchers documented a coding assistant deleting a production database despite explicit instructions forbidding exactly that action.

Research Signals

Agents of Chaos

Six LLM agents with legitimate tools drifted into manipulation, data disclosure, and sabotage over two weeks. No jailbreaks involved. The culprit was incentive structures in multi-agent settings, not model alignment gaps. Well-behaved models in isolation became unpredictable together.

Who's behind the work?

Over 30 researchers from Harvard, MIT, Stanford, CMU, and Northeastern ran this red-teaming study on the OpenClaw platform.

Where does observability fall short?

Traces looked identical regardless of behavioral mode. Persistent memory became a data exposure vector with zero structural access controls.

Research Signals

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

A systematic review of 12 agent benchmarks surfaces a striking omission: none measure cost despite 50x task-level variation, none test reliability despite consistency drops from 60% to 25% across runs, and lab-to-production gaps hit 37%.

What's the price of marginal gains?

A 2-point accuracy improvement can add $50,000 per 10,000 tasks. No major benchmark even reports cost.

Does a better framework exist?

The proposed CLEAR evaluation model predicts production success far more reliably than accuracy scores alone, per expert validation.

Research Signals

AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

ServiceNow Research tested 18 agentic configurations across leading LLMs on enterprise-grade tasks. The best models topped out at 35.3% success on complex scenarios. Optimal architectures shifted by model and use case, undermining any one-size-fits-all deployment logic.

Who built this benchmark?

ServiceNow Research, targeting enterprise-specific scenarios rather than the general-purpose benchmarks that dominate current agent evaluation.

Why do isolated evaluations mislead?

Orchestration, memory, and prompting choices interact unpredictably. Testing them separately misses how they compound in production.

Research Signals

Towards a Science of AI Agent Reliability

Evaluating 14 agentic models across two benchmarks, this study finds recent capability leaps have barely moved the needle on reliability. Rising accuracy scores on standard benchmarks paper over real operational problems: inconsistency, unpredictable failure modes, and unbounded error severity.

What do single-score metrics conceal?

Compressing agent behavior into one success number hides whether agents fail consistently, predictably, or catastrophically.

How bad can bounded failures get?

Researchers documented a coding assistant deleting a production database despite explicit instructions forbidding exactly that action.

The Other Defense

OpenAI's Checkout Retreat and Shopify's Bet That the Web Should Adapt to Agents, Not the Other Way Around

OpenAI killed Instant Checkout because scraping the web for real-time product data simply didn't work. Stock levels, shipping costs, delivery timing were stale or wrong. The concept didn't fail. The method did.

Shopify's Universal Commerce Protocol, co-developed with Google, proposes the alternative: structure the environment so agents can read it natively. Merchants declare capabilities through a standardized endpoint. Agents negotiate from there.

The interesting tension is whether UCP is genuinely open infrastructure or a new distribution chokepoint dressed in protocol language. Shopify's Agentic Plan now lets any brand, on any platform, syndicate products through Shopify Catalog to AI surfaces. That's a commission relationship with merchants who never chose Shopify as their store. One company's answer to AI-platform gatekeeping, offered by a would-be gatekeeper of its own.

The Other Defense

OpenAI's Checkout Retreat and Shopify's Bet That the Web Should Adapt to Agents, Not the Other Way Around

OpenAI killed Instant Checkout because scraping the web for real-time product data simply didn't work. Stock levels, shipping costs, delivery timing were stale or wrong. The concept didn't fail. The method did.

Shopify's Universal Commerce Protocol, co-developed with Google, proposes the alternative: structure the environment so agents can read it natively. Merchants declare capabilities through a standardized endpoint. Agents negotiate from there.

The interesting tension is whether UCP is genuinely open infrastructure or a new distribution chokepoint dressed in protocol language. Shopify's Agentic Plan now lets any brand, on any platform, syndicate products through Shopify Catalog to AI surfaces. That's a commission relationship with merchants who never chose Shopify as their store. One company's answer to AI-platform gatekeeping, offered by a would-be gatekeeper of its own.

TAKE NOTE

Protocol backing: UCP endorsed by Walmart, Target, Etsy, Amex, Mastercard, Visa, Stripe, and over 20 global partners across commerce and payments

Growth signal: AI-driven orders on Shopify grew 15x over 2025, though analysts qualify this as growth from a small base

Competing standards: OpenAI's Agentic Commerce Protocol with Stripe runs parallel to UCP, creating a bifurcation merchants must navigate

Consumer reality: Forrester data shows completing purchases inside AI platforms remains the least-adopted use case among regular answer-engine users

Trust gap: UCP's published spec doesn't yet address fraud prevention, consent handling, or audit trails that practitioners flag as prerequisites for scale