Before an agent reads a single price or clicks a single button, it has already been evaluated. The evaluation is entirely about identity.
During the TLS handshake, a detection system extracts fields from the connection's ClientHello packet and hashes them into a fingerprint. The technique, originally published as JA3 in 2017, has since evolved into JA4+ to handle browser randomization. Modern systems layer behavioral signals on top: mouse acceleration curves, keystroke timing variance. A human approaching a button decelerates. A bot clicks at pixel-perfect coordinates without approach curves. These patterns build a composite identity judgment, regardless of the agent's reasoning capability. Rotating IP addresses doesn't help. Proxies change the source address. The behavioral signature travels with the connection.
The judgment happens before the first byte of content is served. And suspected bots rarely get blocked.
They get lied to.
Researchers studying airline scraping built a honeypot platform that served detected bots modified ticket fares. Prices inflated 5% on a random 10% of requests. The bots continued operating for 53 days. The modifications were too small to trigger plausibility checks, and the bots had no mechanism to compare returned values against ground truth. The data was syntactically perfect. The prices were wrong. Commercial anti-bot systems formalize this approach: detected scrapers can be served different product details, corrupting datasets without alerting anyone to the corruption.
An AI agent in this position does exactly what it should. It processes the data it received, applies its reasoning, and returns a confident, well-formatted output. The output is wrong because the input was wrong. Nothing in the agent's logs, the pipeline's monitoring, or the developer's dashboard distinguishes this from a reasoning failure.
So the developer follows the framework they have. Standard LLM failure taxonomies organize defects along dimensions like specification, context, formatting, and prompt engineering. The prescribed responses: adjust the prompt, swap the model, add retry logic. Absent from every taxonomy: "the environment served false data." The diagnostic framework assumes the inputs were real. The developer tunes prompts. Tries a cheaper model. Adds guardrails. The actual problem persists because nothing in the debugging process can surface it.
Teams that successfully scaled agents to production spent proportionally more on evaluation infrastructure, monitoring, and operational staffing. Teams that stalled spent on model selection and prompt engineering. "Model not capable enough" does not appear among the root causes of scaling failure.
The spending patterns suggest this operates at industry scale. Successful teams invested in evaluation, monitoring, and operational staffing. Teams that stalled poured budget into model selection and prompt engineering. The five root causes behind 89% of scaling failures are integration complexity, inconsistent output quality, absent monitoring, unclear ownership, and insufficient domain data. "Model not capable enough" does not appear.
Successful teams solved the body problem without naming it. The teams that stalled kept optimizing the mind. And the industry's diagnostic apparatus, built entirely around reasoning failures, continues to route attention toward the model layer and away from the infrastructure layer where production failures actually originate. There is no standard category for "the environment rejected the agent's identity and served it poisoned data." Without that category, the fix goes to the wrong address every time.
The agent keeps running. Processing poisoned data, producing plausible outputs, 53 days and counting.
Things to follow up on...
-
Behavioral biometrics as detection layer: A 2026 Springer chapter directly compares keystroke dynamics and mouse trajectories for bot detection, formalizing the behavioral signals that make AI agent automation detectable even when browser fingerprints pass.
-
Anthropic on debugging blind spots: Anthropic's engineering team found that teams without structured evals get stuck in reactive loops, fixing one failure while creating another, unable to distinguish real regressions from noise.
-
The 51-deployment Stanford study: Stanford's Digital Economy Lab examined 51 successful enterprise AI deployments and found the technology was consistently described as the easiest part of reaching production.
-
Cloaking goes commercial: Anti-bot vendors now offer products that serve detected bots entirely different websites while real users see protected content, extending the data-poisoning pattern beyond price manipulation to full-page deception.

