CURRENT | Foundations

Field Guide

What You're Measuring and How You're Measuring It

By Rina Takahashi— February 3, 2026

Feature image for article: What You're Measuring and How You're Measuring It

Your agent passes testing, fails in production. Another completes tasks while taking paths you can't see are fragile. A third does exactly what it should but scores zero on task completion. The evaluation says "working"—but working how? Teams get confusing results not because agents are unpredictable, but because what they're measuring and how they're measuring it answer different questions entirely.

Field Guide

What You're Measuring and How You're Measuring It

By Rina Takahashi— February 3, 2026

Your agent passes testing, fails in production. Another completes tasks while taking paths you can't see are fragile. A third does exactly what it should but scores zero on task completion. The evaluation says "working"—but working how? Teams get confusing results not because agents are unpredictable, but because what they're measuring and how they're measuring it answer different questions entirely.

Tools & Techniques

Tools in Context

When Restarting From Scratch Costs Less Than Saving Progress

The alert fires at 2am. Workflow crashed at step four of seven. Should those three successful steps have been saved somewhere? For teams running thousands of high-frequency workflows daily, the answer surprises: restart from scratch. The coordination overhead—state management across workers, cleanup routines, synchronization delays—costs more than re-executing from the beginning. At certain scales and workflow durations, accepting occasional failures beats implementing elaborate persistence. Production math reveals when operational simplicity wins.

Tools in Context

What It Actually Takes to Resume a Crashed Workflow

A compliance workflow authenticates, scrapes forty pages of transaction history, then waits three hours for manual legal review before continuing. You can't keep a browser running that long. Sessions expire, memory leaks accumulate, containers get recycled on schedule. The workflow must terminate and resume hours later with state intact. For teams running long workflows across distributed fleets, checkpoints stop being optional. But state persistence brings specific overhead: synchronization delays, cleanup routines, session tracking across workers. Here's what coordination actually requires.

Tools & Techniques

Tools in Context

When Restarting From Scratch Costs Less Than Saving Progress

The alert fires at 2am. Workflow crashed at step four of seven. Should those three successful steps have been saved somewhere? For teams running thousands of high-frequency workflows daily, the answer surprises: restart from scratch. The coordination overhead—state management across workers, cleanup routines, synchronization delays—costs more than re-executing from the beginning. At certain scales and workflow durations, accepting occasional failures beats implementing elaborate persistence. Production math reveals when operational simplicity wins.

Tools in Context

What It Actually Takes to Resume a Crashed Workflow

A compliance workflow authenticates, scrapes forty pages of transaction history, then waits three hours for manual legal review before continuing. You can't keep a browser running that long. Sessions expire, memory leaks accumulate, containers get recycled on schedule. The workflow must terminate and resume hours later with state intact. For teams running long workflows across distributed fleets, checkpoints stop being optional. But state persistence brings specific overhead: synchronization delays, cleanup routines, session tracking across workers. Here's what coordination actually requires.

Tools & Techniques

Tools in Context

When Restarting From Scratch Costs Less Than Saving Progress

The alert fires at 2am. Workflow crashed at step four of seven. Should those three successful steps have been saved somewhere? For teams running thousands of high-frequency workflows daily, the answer surprises: restart from scratch. The coordination overhead—state management across workers, cleanup routines, synchronization delays—costs more than re-executing from the beginning. At certain scales and workflow durations, accepting occasional failures beats implementing elaborate persistence. Production math reveals when operational simplicity wins.

Tools in Context

What It Actually Takes to Resume a Crashed Workflow

A compliance workflow authenticates, scrapes forty pages of transaction history, then waits three hours for manual legal review before continuing. You can't keep a browser running that long. Sessions expire, memory leaks accumulate, containers get recycled on schedule. The workflow must terminate and resume hours later with state intact. For teams running long workflows across distributed fleets, checkpoints stop being optional. But state persistence brings specific overhead: synchronization delays, cleanup routines, session tracking across workers. Here's what coordination actually requires.

In Dialogue With Complexity

An Interview with Non-Deterministic Output About Refusing to Be the Same Twice

In Dialogue With Complexity

An Interview with Non-Deterministic Output About Refusing to Be the Same Twice

Pattern Recognition

Identity Vendors Rush Agent IAM Products in January

Four identity vendors shipped agent-specific IAM within three weeks. Microsoft's Entra Agent ID launched January 20. Cloud Security Alliance released MAESTRO framework January 15. Qualys shipped Agent Grant January 6. Exabeam rolled out Agent Behavior Analytics earlier that month.

Traditional IAM expects identities to last months or years. Agents live seconds or minutes. Legacy systems track human credentials. Agents operate through delegation chains nothing was built to monitor.

Security audits keep finding thousands of ungoverned agent identities already running. Some organizations hit 17 agents per employee. Vendors watched the same pattern repeat: customers discovering agent sprawl they never authorized, never tracked, couldn't govern with existing tools.

Pattern Recognition

Identity Vendors Rush Agent IAM Products in January

Four identity vendors shipped agent-specific IAM within three weeks. Microsoft's Entra Agent ID launched January 20. Cloud Security Alliance released MAESTRO framework January 15. Qualys shipped Agent Grant January 6. Exabeam rolled out Agent Behavior Analytics earlier that month.

Traditional IAM expects identities to last months or years. Agents live seconds or minutes. Legacy systems track human credentials. Agents operate through delegation chains nothing was built to monitor.

Security audits keep finding thousands of ungoverned agent identities already running. Some organizations hit 17 agents per employee. Vendors watched the same pattern repeat: customers discovering agent sprawl they never authorized, never tracked, couldn't govern with existing tools.

Lifecycle mismatch:

Agents need just-in-time provisioning and automatic expiration after tasks complete, contradicting IAM practices built for persistent accounts over decades.

Permission escalation:

Organizational agents frequently run with broader access than individual users, creating unintended paths when user context disappears mid-workflow.

Discovery shock:

Security scans reveal thousands of unknown agent identities already operating, what vendors describe as customers' "jaw-dropping moments" during audits.

Adoption pressure:

Gartner projects 40% of enterprise applications integrating with agents by late 2026, up from under 5% in 2025.

Governance lag:

57% of surveyed organizations already run production agents, but 80% deployed without proper frameworks, creating retroactive compliance nightmares.

Questions Worth Asking

Before you deploy anything to production, you need better questions than "does it work?" Most systems fail in the gap between functional and production-ready. Not from broken code. From teams that never asked what happens when things break, who pays for it, or who fixes it at 3 AM.

These questions come from operating systems under real load. Each one surfaces a factor that gets overlooked in demos but becomes expensive after launch. The right question at the right time can save you from mistakes that only become obvious when you're already live.

Questions Worth Asking

Before you deploy anything to production, you need better questions than "does it work?" Most systems fail in the gap between functional and production-ready. Not from broken code. From teams that never asked what happens when things break, who pays for it, or who fixes it at 3 AM.

These questions come from operating systems under real load. Each one surfaces a factor that gets overlooked in demos but becomes expensive after launch. The right question at the right time can save you from mistakes that only become obvious when you're already live.

Contractual Consequences

What Happens When This Breaks?

If you miss targets, does someone write a check or just send an apologetic email? Real SLAs trigger rebates and contract penalties. Everything else is an SLO you're tracking internally. That difference determines whether you architect for guarantees or aspirations.

Cost Visibility

Can You Explain Next Month's Bill?

When cloud costs spiral, can you trace spending to specific workloads? Production systems need cost monitoring that spots optimization opportunities before budget meetings turn awkward. Guessing at invoices means you're not ready to scale. Period.

Recovery Planning

Who Fixes This at 3 AM?

Incident response plans define who does what when systems fail. Without clear ownership and coordination tools, disruptions become disasters. The gap between a hiccup and a crisis? Knowing exactly who responds, how they coordinate, what recovery actually looks like.

Infrastructure Codification

Is Your Infrastructure Clicking or Coded?

Provisioning through cloud consoles puts you at the lowest maturity level. Elite teams manage production infrastructure through version-controlled IaC with automated validation. Manual provisioning creates drift you can't track. It doesn't scale.

Observability Implementation

Can You See What's Actually Happening?

For AI agents, 89% have observability while only 52% have evaluation frameworks. That gap matters. You need visibility into behavior before you can measure quality. Observability shows you what your system does under real conditions, not what you hoped it would do.

Cultural Indicators

After Incidents, Do You Hunt Blame?

The most sophisticated toolchain fails if teams operate in silos. Elite organizations conduct blameless post-mortems focused on systemic root causes, not guilty individuals. Culture determines whether your technical capabilities translate to actual performance. Tools alone never do.