CURRENT | Foundations

Field Guide

When to Automate and When to Keep Humans in the Loop

By Nora Kaplan— December 10, 2025

Feature image for article: When to Automate and When to Keep Humans in the Loop

Teams come to us after watching agent demos, asking: "Can we automate our entire competitive monitoring process?" Sure. Technically, we can. That's exactly when things fall apart. The agents work fine. Trust collapses. Pricing data gets extracted incorrectly after a site redesign, and suddenly the whole system feels unreliable. Operating web agents across thousands of sites taught us this: technical capability isn't the constraint. Knowing where to draw the automation boundary is.

Field Guide

When to Automate and When to Keep Humans in the Loop

By Nora Kaplan— December 10, 2025

Teams come to us after watching agent demos, asking: "Can we automate our entire competitive monitoring process?" Sure. Technically, we can. That's exactly when things fall apart. The agents work fine. Trust collapses. Pricing data gets extracted incorrectly after a site redesign, and suddenly the whole system feels unreliable. Operating web agents across thousands of sites taught us this: technical capability isn't the constraint. Knowing where to draw the automation boundary is.

Tools & Techniques

Tools in Context

When Scrapers Break Clean

Your scraper stops returning prices overnight. Every field comes back null, and your monitoring catches it immediately because the data structure broke. This is the kind of failure teams can work with—loud, obvious, fixable. Rule-based validation exists for these moments, stopping bad data before it flows downstream. At scale, catching structural breaks isn't optional. It's survival.

Tools in Context

When Data Drifts Quietly

Your scraper runs perfectly for months. Every field validates, types match, formats check out. Then someone in analytics notices the "product weight" field now contains shipping estimates. The data structure is fine. The meaning has drifted. This is what rule-based validation misses—the quiet corruption where everything looks right but nothing means what it should anymore.

Tools & Techniques

Tools in Context

When Scrapers Break Clean

Your scraper stops returning prices overnight. Every field comes back null, and your monitoring catches it immediately because the data structure broke. This is the kind of failure teams can work with—loud, obvious, fixable. Rule-based validation exists for these moments, stopping bad data before it flows downstream. At scale, catching structural breaks isn't optional. It's survival.

Tools in Context

When Data Drifts Quietly

Your scraper runs perfectly for months. Every field validates, types match, formats check out. Then someone in analytics notices the "product weight" field now contains shipping estimates. The data structure is fine. The meaning has drifted. This is what rule-based validation misses—the quiet corruption where everything looks right but nothing means what it should anymore.

Tools & Techniques

Tools in Context

When Scrapers Break Clean

Your scraper stops returning prices overnight. Every field comes back null, and your monitoring catches it immediately because the data structure broke. This is the kind of failure teams can work with—loud, obvious, fixable. Rule-based validation exists for these moments, stopping bad data before it flows downstream. At scale, catching structural breaks isn't optional. It's survival.

Tools in Context

When Data Drifts Quietly

Your scraper runs perfectly for months. Every field validates, types match, formats check out. Then someone in analytics notices the "product weight" field now contains shipping estimates. The data structure is fine. The meaning has drifted. This is what rule-based validation misses—the quiet corruption where everything looks right but nothing means what it should anymore.

In Dialogue With Complexity

A Conversation with the Algorithm That Decides If You're Human Enough

In Dialogue With Complexity

A Conversation with the Algorithm That Decides If You're Human Enough

Pattern Recognition

Procurement Teams Can't Buy What They Can't Measure

Watch what happens when procurement departments try to evaluate AI agents. They pull out the same RFP templates they use for CRM systems. Uptime guarantees. Fixed feature lists. Predictable outputs.

Agents don't work that way. They adapt. They learn. They behave differently across contexts.

Eighteen percent of organizations say they're not adopting agents because of "unclear use cases." That's code for something else. Healthcare systems report they don't even have frameworks for AI procurement yet. The buying process itself has become the bottleneck.

Traditional software behaves the same way twice. You can test it, measure it, lock down the requirements. Agents improve through use. Standard procurement can't evaluate that.

Pattern Recognition

Procurement Teams Can't Buy What They Can't Measure

Watch what happens when procurement departments try to evaluate AI agents. They pull out the same RFP templates they use for CRM systems. Uptime guarantees. Fixed feature lists. Predictable outputs.

Agents don't work that way. They adapt. They learn. They behave differently across contexts.

Eighteen percent of organizations say they're not adopting agents because of "unclear use cases." That's code for something else. Healthcare systems report they don't even have frameworks for AI procurement yet. The buying process itself has become the bottleneck.

Traditional software behaves the same way twice. You can test it, measure it, lock down the requirements. Agents improve through use. Standard procurement can't evaluate that.

Speed collision:

Procurement cycles take months while AI capabilities evolve weekly, forcing impossible trade-offs between thoroughness and relevance.

Safe deflection:

Leaders cite security concerns publicly while real barriers like organizational change and integration complexity rank lowest.

Healthcare gap:

Research shows most AI studies skip comprehensive procurement frameworks, leaving hospitals without systematic evaluation approaches.

Behavior measurement:

Effective evaluation requires tracking agent performance patterns across scenarios over time, not snapshot demonstrations.

Framework mismatch:

Deterministic software checklists can't assess systems designed to adapt, creating structural evaluation blind spots.

Questions Worth Asking

The questions you ask when evaluating production systems reveal what you've learned the hard way. Experienced builders skip "can this work?" and ask questions that expose what breaks at scale, what's expensive to fix later, and what marketing materials conveniently omit.

These questions predict whether your POC becomes production infrastructure or expensive homework. Whether your database choice lets you grow or forces a rebuild. Whether your monitoring explains what's wrong or just signals that something is wrong.

The right questions cut through demos and documentation to what actually matters when systems face real users, real data, and real consequences.

Questions Worth Asking

The questions you ask when evaluating production systems reveal what you've learned the hard way. Experienced builders skip "can this work?" and ask questions that expose what breaks at scale, what's expensive to fix later, and what marketing materials conveniently omit.

These questions predict whether your POC becomes production infrastructure or expensive homework. Whether your database choice lets you grow or forces a rebuild. Whether your monitoring explains what's wrong or just signals that something is wrong.

The right questions cut through demos and documentation to what actually matters when systems face real users, real data, and real consequences.

Explainability Architecture

Can You Retrofit Explainability Later?

Teams skip explainability during POCs, planning to add it before launch. This almost never works. Explainability requires architectural decisions from day one. Skip it now and you're building technical debt that gets exponentially harder to fix. Ask how you'll explain decisions to users before you write the first line of code.

Infrastructure Scalability

Does Your Database Scale Sideways?

When you need more capacity, can you add servers or just bigger servers? Horizontal scaling determines whether you grow or rebuild from scratch. Many databases make adding machines nearly impossible. Your initial choice locks you into a scaling path that's expensive to escape.

AI SLAs

What's Your SLA When Outputs Are Probabilistic?

Traditional SLAs cover uptime and latency. AI systems need agreements around accuracy, adaptation speed, and decision quality. Cloud providers focus on availability because quality degradation resists contractual definition. If your SLA only measures uptime, you're missing what actually matters for AI.

Production Observability

Does Your Observability Explain Why?

Monitoring shows when something breaks. Observability shows what's happening and why it's happening. For agents, track context drift and reasoning patterns, not just errors and latency. Companies with structured evaluation frameworks see sixty percent fewer production incidents than those watching dashboards alone.

Evaluation Strategy

How Different Is Production From Your Test Data?

Offline evaluation uses curated datasets in controlled environments. Online evaluation monitors real user interactions. The gap between them reveals model drift, unexpected queries, and edge cases your test data missed. Testing only offline means optimizing for conditions that don't exist in production.

Cost Monitoring

Can You See Cost Per Request Live?

An agent calling an LLM five times might be justified quality improvement or a bug causing runaway costs. Real-time cost visibility reveals whether you're trading money for quality or bleeding budget on inefficiency. Without live monitoring, expensive mistakes show up in monthly bills, not deployment dashboards.