The Wrong Layer

Prompt injection lives in the page. Defenses concentrate where results are shippable, leaving the actual attack surface largely unaddressed.

By Rina Takahashi— April 29, 2026

Prompt injection lives in the page. Defenses concentrate where results are shippable, leaving the actual attack surface largely unaddressed.

The prompt injection attacks documented this month by Google and Forcepoint share a quality worth noticing: they're embedded in the page itself. Text shrunk to a single pixel. Instructions tucked inside HTML comments. CSS that renders payloads invisible to humans but fully readable to any agent parsing the source. The manipulation happens before the model ever sees the input.

This is known. OpenAI has moved from calling prompt injection "unlikely to ever be fully solved" in December 2025 to assuming agents will be misled and focusing on damage containment by March 2026. Anthropic calls the web "an adversarial environment" outright. A meta-analysis of 78 studies found adaptive attacks succeed against state-of-the-art defenses more than 85% of the time.

Both companies acknowledge the model layer cannot fully contain this. Nearly all defense investment concentrates there anyway.

The reason is structural, and it follows from what produces a shippable result.

AgentDojo, the most widely used evaluation framework for prompt injection, measures attack success rate, benign utility, and utility under attack. All model-behavior metrics. These produce clean numbers that compare across papers, feed into leaderboards, and generate press-friendly figures. Anthropic's "1% attack success rate" is exactly this kind of artifact: a number that travels well. What happens when a real adversary studies the defense and routes around it is something the benchmark infrastructure was never built to capture.

Environment-layer defenses do exist in research. CaMeL, from Google DeepMind, achieves provable security on 77% of benchmark tasks by enforcing data-instruction separation architecturally, without modifying the model at all. StruQ separates prompt and data channels. Real contributions. But a model-layer improvement produces a number, fits into an existing evaluation framework, and ships as a model update. An environment-layer improvement produces an architecture. Architectures are harder to benchmark, harder to compare, harder to describe in a changelog. So they stay in papers.

The gap that widens

The attack surface is the HTML. The defense surface is the model's response to it. That gap widens precisely on the side where defenses are hardest to ship.

Google's scan of billions of crawled pages found a 32% increase in malicious prompt injection payloads between November 2025 and February 2026. Forcepoint documented payloads targeting PayPal transactions, API key exfiltration, and destructive commands, all hidden in page elements that no model-layer defense can prevent from entering the context window.

This points somewhere beyond security. Measurable results attract investment; problems that resist productization sit unfunded. For anyone deploying agents into untrusted web environments, the implication is concrete: the defenses their systems carry were benchmarked against a version of the problem that bears little resemblance to the one they'll encounter in production. The benchmarks are real. The progress they measure is real. Whether any of it is progress toward the right problem is something the benchmarks themselves will never tell you.

Things to follow up on...

Governments know more than benchmarks show: The UK and US AI Safety Institutes developed stronger attacks against Claude 3.5 Sonnet using AgentDojo, but those attacks aren't publicly available, meaning the public leaderboard understates known vulnerability.
Adaptive attacks break everything: Microsoft's LLMail-Inject challenge, involving 839 participants and 208,095 unique attack prompts, found that models performing well on static benchmarks fared dramatically worse against real human adversaries in a realistic email-agent environment.
Google expects cost-benefit to shift: Google's threat intelligence team notes that past prompt injection attempts were low-sophistication partly because compromised agents couldn't reliably act on instructions, but today's more capable agents make the economics of attack far more attractive.
Shared tooling across domains: Forcepoint found shared injection templates appearing across multiple unrelated websites, suggesting organized tooling infrastructure rather than isolated experimentation by individual attackers.

Both companies acknowledge the model layer cannot fully contain this. Nearly all defense investment concentrates there anyway.

The reason is structural, and it follows from what produces a shippable result.

The gap that widens

The attack surface is the HTML. The defense surface is the model's response to it. That gap widens precisely on the side where defenses are hardest to ship.

Things to follow up on...

Governments know more than benchmarks show: The UK and US AI Safety Institutes developed stronger attacks against Claude 3.5 Sonnet using AgentDojo, but those attacks aren't publicly available, meaning the public leaderboard understates known vulnerability.
Adaptive attacks break everything: Microsoft's LLMail-Inject challenge, involving 839 participants and 208,095 unique attack prompts, found that models performing well on static benchmarks fared dramatically worse against real human adversaries in a realistic email-agent environment.
Google expects cost-benefit to shift: Google's threat intelligence team notes that past prompt injection attempts were low-sophistication partly because compromised agents couldn't reliably act on instructions, but today's more capable agents make the economics of attack far more attractive.
Shared tooling across domains: Forcepoint found shared injection templates appearing across multiple unrelated websites, suggesting organized tooling infrastructure rather than isolated experimentation by individual attackers.