CURRENT | Foundations

Field Guide

What Reliability Actually Means in Adversarial Environments

By Rina Takahashi— November 20, 2025

Feature image for article: What Reliability Actually Means in Adversarial Environments

We've seen web agent systems maintain 100% availability while delivering zero correct results. Every request completes successfully. Every HTTP response returns 200. The infrastructure is "up." But every extraction triggers detection and returns error pages instead of data. Traditional uptime metrics can't capture what's actually broken. When the web actively resists automation, reliability means something different than availability percentages—something most teams don't measure until after deployment.

Field Guide

What Reliability Actually Means in Adversarial Environments

By Rina Takahashi— November 20, 2025

We've seen web agent systems maintain 100% availability while delivering zero correct results. Every request completes successfully. Every HTTP response returns 200. The infrastructure is "up." But every extraction triggers detection and returns error pages instead of data. Traditional uptime metrics can't capture what's actually broken. When the web actively resists automation, reliability means something different than availability percentages—something most teams don't measure until after deployment.

Web Archaeology

The Paradox of Web Transparency

By Rina Takahashi— November 20, 2025

Feature image for article: The Paradox of Web Transparency

View Source still works on every website. The HTML arrives readable, transparent, exactly as Tim Berners-Lee designed it in 1991. The web's founding principle of radical openness never disappeared. The architecture still assumes anyone can see how things work.

Yet building reliable automation on this transparent foundation now requires infrastructure most enterprises can't afford. The same openness that enabled the web's growth created conditions for something unexpected. The architecture remained universal. Who could act on it did not.

Web Archaeology

The Paradox of Web Transparency

View Source still works on every website. The HTML arrives readable, transparent, exactly as Tim Berners-Lee designed it in 1991. The web's founding principle of radical openness never disappeared. The architecture still assumes anyone can see how things work.

Yet building reliable automation on this transparent foundation now requires infrastructure most enterprises can't afford. The same openness that enabled the web's growth created conditions for something unexpected. The architecture remained universal. Who could act on it did not.

Rina Takahashi

Rina Takahashi, 37, former marketplace operations engineer turned enterprise AI writer. Built and maintained web-facing automations at scale for travel and e-commerce platforms. Now writes about reliable web agents, observability, and production-grade AI infrastructure at TinyFish.

In Dialogue With Complexity

In Conversation with the Reject All Button

In Dialogue With Complexity

In Conversation with the Reject All Button

Pattern Recognition from the Field

Pilots Succeed, Production Fails: The Agent Deployment Gap

I keep seeing the same thing: companies run successful AI agent pilots, then production deployment stalls. Pilots jumped from 37% to 65% in one quarter. Production deployment? Still stuck at 11%. MIT found only 5% of custom enterprise AI tools actually make it to production.

Look at what's breaking. Apple delayed Siri features. Amazon stripped down Alexa+. Salesforce's Einstein hits data silos. The pattern isn't about AI capability. Eighty percent of enterprises can't connect their systems. Legacy infrastructure wasn't built for autonomous agents.

The companies getting agents into production? They redesigned workflows first. McKinsey found they're twice as likely to see real ROI. This is an infrastructure problem that happens to involve AI, not an AI problem that needs infrastructure.

Pattern Recognition from the Field

Pilots Succeed, Production Fails: The Agent Deployment Gap

I keep seeing the same thing: companies run successful AI agent pilots, then production deployment stalls. Pilots jumped from 37% to 65% in one quarter. Production deployment? Still stuck at 11%. MIT found only 5% of custom enterprise AI tools actually make it to production.

Look at what's breaking. Apple delayed Siri features. Amazon stripped down Alexa+. Salesforce's Einstein hits data silos. The pattern isn't about AI capability. Eighty percent of enterprises can't connect their systems. Legacy infrastructure wasn't built for autonomous agents.

The companies getting agents into production? They redesigned workflows first. McKinsey found they're twice as likely to see real ROI. This is an infrastructure problem that happens to involve AI, not an AI problem that needs infrastructure.

Integration burden:

Connecting existing systems consumes 40% of IT resources before agents can operate, with 80% of enterprises struggling with basic connectivity.

Framework defaults:

LangChain dominates at 55.6% adoption while OpenAI powers 73.6% of projects, showing clear winners emerging despite vendor proliferation.

Workflow redesign:

Organizations seeing significant ROI are twice as likely to have rebuilt end-to-end processes with explicit human handoff points.

Cost scaling:

Very large models like GPT-4o deliver accuracy but become prohibitively expensive when running multi-agent systems at production scale.

Agent washing:

Gartner estimates only 130 of thousands of vendors claiming "agentic AI" actually deliver autonomous capabilities versus rebranded automation.

Questions Worth Asking

Most evaluation criteria focus on what works in controlled conditions. Feature completeness. Benchmark performance. API elegance. Then production happens.

The expensive mistakes come from asking the wrong questions early. Teams evaluate capabilities instead of operational burden. They test performance with clean data instead of the messy reality their systems will face. They assume debuggability instead of verifying it.

After enough deployments, you learn which questions actually predict success. They're not the ones in vendor comparison matrices. They're the ones that reveal what happens when demos meet real traffic, when requirements change, when something breaks at 3am and you need answers fast.

Questions Worth Asking

Most evaluation criteria focus on what works in controlled conditions. Feature completeness. Benchmark performance. API elegance. Then production happens.

The expensive mistakes come from asking the wrong questions early. Teams evaluate capabilities instead of operational burden. They test performance with clean data instead of the messy reality their systems will face. They assume debuggability instead of verifying it.

After enough deployments, you learn which questions actually predict success. They're not the ones in vendor comparison matrices. They're the ones that reveal what happens when demos meet real traffic, when requirements change, when something breaks at 3am and you need answers fast.

Demo Reality

What Changes Between Demo and Your Data?

Demos run on sanitized datasets with known edge cases handled. Your production data has nulls in unexpected places, encodings that break parsers, and user behavior no QA team anticipated. Ask for deployment timelines on real customer data, not feature completeness on their test set.

Debugging Access

Can You Ask New Questions When It Breaks?

Monitoring shows you that something failed. Observability lets you figure out why without redeploying instrumentation. When production breaks at 3am, you need to interrogate the system with questions you didn't think to ask yesterday. Predefined dashboards won't cut it.

Stack Integration

Does This Work With What You Already Run?

Your best deployment platform is the one your ops team already knows. If this requires new infrastructure or doesn't integrate with existing monitoring, you're not evaluating a tool. You're committing to a migration project that'll consume six months nobody planned for.

Change Velocity

Who Owns Changes After Initial Deployment?

Requirements shift constantly. New data sources, modified workflows, unexpected use cases. Can your team adapt the system themselves, or does every adjustment require vendor involvement? This determines whether your investment compounds or becomes technical debt you're stuck maintaining.

Load Behavior

Where Does Performance Degrade Under Real Traffic?

Test conditions reveal nothing about production behavior. What happens at 10x your demo load during a traffic spike? Does latency climb linearly or does something cliff-dive? The architectural choices that seem theoretical now determine whether you're rebuilding when actual users arrive.

Operational Tax

What's the Daily Maintenance Burden?

Some infrastructure fades into the background. Other systems demand constant attention. Before committing, understand monitoring requirements, rollback procedures, incident patterns. Production readiness reviews take months at scale because this question has expensive answers vendors won't surface in sales calls.