CURRENT | Foundations

Field Guide

The Observation Gap in Agent Delegation

By Nora Kaplan— December 24, 2025

Feature image for article: The Observation Gap in Agent Delegation

Your competitor pricing agent just navigated authentication flows across two thousand retail sites, handled bot detection that varies by region, interpreted A/B tested structures, and distinguished genuine price changes from temporary glitches. It made hundreds of judgment calls about when to retry, when to escalate, whether anomalies matter.

You were in meetings the entire time. Now you're looking at the output—clean data, confidence scores, flagged uncertainties. The agent operated beyond your observation, and you're deciding whether to trust it. Most organizations treat this like learning new software. It isn't.

Field Guide

The Observation Gap in Agent Delegation

Your competitor pricing agent just navigated authentication flows across two thousand retail sites, handled bot detection that varies by region, interpreted A/B tested structures, and distinguished genuine price changes from temporary glitches. It made hundreds of judgment calls about when to retry, when to escalate, whether anomalies matter.

You were in meetings the entire time. Now you're looking at the output—clean data, confidence scores, flagged uncertainties. The agent operated beyond your observation, and you're deciding whether to trust it. Most organizations treat this like learning new software. It isn't.

Nora Kaplan

Nora Kaplan, former collaboration platform product leader turned technology writer. Studied human-computer interaction and spent years designing tools for knowledge work. Now writes about AI agents, work transformation, and how enterprise software reshapes human capability at TinyFish.

Tools & Techniques

Tools in Context

What High-Frequency Monitoring Actually Catches

Check a website once and you learn whether it's working. Check it every minute for weeks and you learn how it behaves. High-frequency monitoring trades depth for speed—tests stay simple because they need to complete fast. But that speed reveals something synthetic checks can't: the patterns that emerge when you're watching constantly. Bot detection rolling out overnight. Site structures shifting during inventory updates. The things that break in production, as they're breaking.

Tools in Context

When Shadow Testing Reveals What Monitoring Misses

Shadow testing runs your new code against real production traffic for days or weeks, processing every edge case your infrastructure encounters while users see results from your current system. It's expensive—you're running everything twice. But some things about web agent reliability only become visible when you're handling actual authentication challenges, actual regional variations, actual bot detection patterns. The complexity that monitoring catches quickly, shadow testing catches thoroughly.

Tools & Techniques

Tools in Context

What High-Frequency Monitoring Actually Catches

Check a website once and you learn whether it's working. Check it every minute for weeks and you learn how it behaves. High-frequency monitoring trades depth for speed—tests stay simple because they need to complete fast. But that speed reveals something synthetic checks can't: the patterns that emerge when you're watching constantly. Bot detection rolling out overnight. Site structures shifting during inventory updates. The things that break in production, as they're breaking.

Tools in Context

When Shadow Testing Reveals What Monitoring Misses

Shadow testing runs your new code against real production traffic for days or weeks, processing every edge case your infrastructure encounters while users see results from your current system. It's expensive—you're running everything twice. But some things about web agent reliability only become visible when you're handling actual authentication challenges, actual regional variations, actual bot detection patterns. The complexity that monitoring catches quickly, shadow testing catches thoroughly.

Tools & Techniques

Tools in Context

What High-Frequency Monitoring Actually Catches

Check a website once and you learn whether it's working. Check it every minute for weeks and you learn how it behaves. High-frequency monitoring trades depth for speed—tests stay simple because they need to complete fast. But that speed reveals something synthetic checks can't: the patterns that emerge when you're watching constantly. Bot detection rolling out overnight. Site structures shifting during inventory updates. The things that break in production, as they're breaking.

Tools in Context

When Shadow Testing Reveals What Monitoring Misses

Shadow testing runs your new code against real production traffic for days or weeks, processing every edge case your infrastructure encounters while users see results from your current system. It's expensive—you're running everything twice. But some things about web agent reliability only become visible when you're handling actual authentication challenges, actual regional variations, actual bot detection patterns. The complexity that monitoring catches quickly, shadow testing catches thoroughly.

In Dialogue With Complexity

An Interview with The Forty-Dollar Threshold

In Dialogue With Complexity

An Interview with The Forty-Dollar Threshold

Pattern Recognition

Companies Keep Buying GPUs They Can't Use

Enterprises plan massive GPU expansion in 2025. Ninety-six percent will add capacity. Yet only 7% achieve above 85% utilization during peak periods. Fifteen percent report fewer than half their GPUs are actually working, even when demand is highest.

The gap between procurement and utilization keeps widening. Companies spent $37 billion on generative AI in 2025, up 3.2x year-over-year. Meanwhile, 74% remain dissatisfied with their job scheduling tools. The top cloud compute concern isn't availability. It's wastage and idle costs.

Watch what organizations do, not what they say. They're treating GPU scarcity as a buying problem when the real constraint is orchestration. More hardware won't fix broken resource allocation.

Pattern Recognition

Companies Keep Buying GPUs They Can't Use

Enterprises plan massive GPU expansion in 2025. Ninety-six percent will add capacity. Yet only 7% achieve above 85% utilization during peak periods. Fifteen percent report fewer than half their GPUs are actually working, even when demand is highest.

The gap between procurement and utilization keeps widening. Companies spent $37 billion on generative AI in 2025, up 3.2x year-over-year. Meanwhile, 74% remain dissatisfied with their job scheduling tools. The top cloud compute concern isn't availability. It's wastage and idle costs.

Watch what organizations do, not what they say. They're treating GPU scarcity as a buying problem when the real constraint is orchestration. More hardware won't fix broken resource allocation.

Idle capacity:

Massive GPU investments sit unused because limited on-demand and self-serve access creates artificial scarcity within abundant resources.

Scheduling failures:

Organizations rank compute limitations as their top scaling challenge while basic automation and allocation systems remain unsolved.

Investment mismatch:

Billions flow into hardware acquisition while core orchestration tools that enable actual utilization lag behind procurement timelines.

Utilization blindness:

Companies expand capacity without measuring whether existing resources are effectively deployed, perpetuating the cycle of underutilization.

Questions Worth Asking

Watch any evaluation process and you'll see the same mistake. Teams compare tools before they know what they're comparing them against.

We've watched dozens of production deployments. The pattern holds: successful teams define evaluation criteria first. They know their data infrastructure. They understand integration requirements. They've calculated what a decision costs, not just what the subscription costs.

These questions predict what works at scale. They come from watching things break and watching things hold. Use them before the demo starts and the pitch begins.

Questions Worth Asking

Watch any evaluation process and you'll see the same mistake. Teams compare tools before they know what they're comparing them against.

We've watched dozens of production deployments. The pattern holds: successful teams define evaluation criteria first. They know their data infrastructure. They understand integration requirements. They've calculated what a decision costs, not just what the subscription costs.

These questions predict what works at scale. They come from watching things break and watching things hold. Use them before the demo starts and the pitch begins.

Success Metrics

What Does Good Enough Look Like?

Define measurable success criteria before you evaluate anything. Translate business requirements into thresholds you can test against. The right accuracy number matters less than knowing which threshold serves your use case. Build the scorecard first.

Data Infrastructure

Is Your Data Ready for This?

Your data infrastructure predicts adoption more than model sophistication does. Quality, accessibility, governance. Without clean, relevant data in the right place, sophisticated models just fail faster. Evaluate data readiness before evaluating tools.

Resource Efficiency

What's the Real Cost Per Decision?

Pricing pages lie by omission. They show monthly fees, not computational costs per action. Calculate cost per decision: compute, maintenance, failure handling. Agents deployed without rigorous evaluation burn resources while delivering inconsistent results.

Integration Requirements

Does This Embed or Just Connect?

API integrations create maintenance overhead and data sync problems. Look for tools that embed directly into core systems: CRM, marketing automation, customer service. Real-time access to unified data generates accurate outputs. Manual transfers introduce security risks and staleness.

Architectural Complexity

Does Multi-Agent Architecture Solve Anything?

One agent with sufficient tools often beats three agents coordinating. Decision-making and flow-control overhead can exceed multi-agent benefits. Splitting tasks across agents distributes problems without necessarily improving outcomes. Confirm the complexity adds value.

Transparency Requirements

Can You Explain This to Leadership?

Autonomy demands transparency. Systems should log every reasoning step, making consequences explicit before execution. If you can't explain why the agent made that choice to your CFO, you can't trust it with decisions that matter.