The Work the Spreadsheet Can't See

Agent deployments create whole new categories of operational work that substitution-based ROI models structurally cannot see or fund.

By Nora Kaplan— March 25, 2026

Agent deployments create whole new categories of operational work that substitution-based ROI models structurally cannot see or fund.

Somewhere in every agent deployment there's a spreadsheet. It has a column for the human task being replaced, a column for the hourly cost, and a column for the projected savings. The spreadsheet is how the project gets funded. It is also, quietly, the reason the most significant costs of the project will never be visible to the people who approved it.

The spreadsheet runs on substitution logic: an agent performs a task a human used to perform, the human cost disappears. Simple, clean, fundable. But the logic contains an assumption so embedded it doesn't register as an assumption at all. It assumes that the work surrounding the task stays constant. That the task was a discrete, removable unit. That nothing new gets created in its place.

After deployment, the picture looks nothing like removal. A single agent step with 95% reliability sounds fine. Ten steps chained together: 59.9%. Twenty: 35.8%. That reliability gap has to be managed by someone. Prompt maintenance when model updates shift how instructions land. Drift detection when output quality degrades gradually, so that no individual run fails but the aggregate has already moved. Root cause triage across layers of possible failure: the model, the retrieval, the tool call, the prompt itself. Each of these is a recurring operational activity that did not exist before the agent was deployed and has no corresponding line in the business case that justified it.

The observability-evaluation gap

89% of organizations have implemented observability for their agent systems. Only 52% have implemented evaluation. Teams confirm the system ran. Far fewer confirm the system worked.

LangChain's 2025 survey puts a number on the shape of this. Teams watch the system run. They track latency, token counts, tool calls, error rates. Whether the outputs are correct gets measured far less. Google's Office of the CTO noted that every GenAI project rapidly becomes an evaluation project, but evaluation is the capability teams adopt last and fund least. The space between watching a system run and knowing it worked is where most of the uncounted labor accumulates.

The RPA wave left a record worth reading here. HfS Research estimated that licensing was only 25–30% of total cost. The rest was maintenance, exception handling, governance. Thoughtworks, citing Forrester, warned that wide use of RPA would instantiate another layer of software to be managed. The industry drew a lesson about implementation discipline. The accounting framework that made these costs invisible went unexamined.

Substitution logic prices a bundle. A human doing a task includes, invisibly, the validation, the judgment calls, the ambient monitoring that keeps the task connected to reality. Automation fragments that bundle. The agent performs the task. The validation, monitoring, maintenance, and judgment scatter into new activities that get assigned to someone's calendar without appearing in anyone's budget. Accumulated, those activities constitute something that looks a lot like an internal platform nobody planned to build: evaluation scaffolding, drift monitoring, prompt versioning, failure triage workflows. The spreadsheet that funded the project has no row for work the project itself calls into being. And because the spreadsheet is the only instrument the organization has for seeing cost, the work doesn't get named, doesn't get funded, and when it eventually surfaces as a problem, gets attributed to something else entirely.

Things to follow up on...

Drift as silent degradation: A CIO.com analysis describes how agentic systems rarely produce a single catastrophic error, instead drifting incrementally as models update, prompts are refined, and dependencies shift beneath teams that passed every review gate.
Reliability drops at scale: An arXiv study evaluating state-of-the-art agents found that performance drops from 60% on a single run to 25% when measured across eight consecutive runs, exposing a consistency gap that single-run benchmarks structurally miss.
Stack churn compounds maintenance: Cleanlab's 2025 survey found that 70% of regulated enterprises update their AI agent stack every three months or faster, turning infrastructure stabilization into a recurring cost that never appears in the original deployment plan.
The RPA cost parallel: Thoughtworks warned in 2021, citing Forrester, that organizations found RPA solutions didn't meet expected ROI because bot maintenance costs were a big deal, a pattern now repeating almost identically in AI agent post-deployment operations.

The observability-evaluation gap

89% of organizations have implemented observability for their agent systems. Only 52% have implemented evaluation. Teams confirm the system ran. Far fewer confirm the system worked.

Things to follow up on...

Drift as silent degradation: A CIO.com analysis describes how agentic systems rarely produce a single catastrophic error, instead drifting incrementally as models update, prompts are refined, and dependencies shift beneath teams that passed every review gate.
Reliability drops at scale: An arXiv study evaluating state-of-the-art agents found that performance drops from 60% on a single run to 25% when measured across eight consecutive runs, exposing a consistency gap that single-run benchmarks structurally miss.
Stack churn compounds maintenance: Cleanlab's 2025 survey found that 70% of regulated enterprises update their AI agent stack every three months or faster, turning infrastructure stabilization into a recurring cost that never appears in the original deployment plan.
The RPA cost parallel: Thoughtworks warned in 2021, citing Forrester, that organizations found RPA solutions didn't meet expected ROI because bot maintenance costs were a big deal, a pattern now repeating almost identically in AI agent post-deployment operations.