TinyFish | Practitioner's Corner

Designing for Distrust

Reading an Architecture as an Argument About Failure

By Rina Takahashi— February 25, 2026

Feature image for article: Reading an Architecture as an Argument About Failure

During their January 2026 Launch Week, Skyvern shipped a feature that lets users upload a PDF of a human standard operating procedure and have an AI agent build a workflow from it. A procedure manual written for people, fed directly to a machine. It's a revealing choice, and what it compensates for runs through the rest of the architecture.

That compensation points at something specific. Skyvern's architecture, from its three-phase task pipeline to where it runs its benchmarks, encodes a particular understanding of where web agent workflows actually break. Three design decisions, three different failure modes. The thread connecting them is worth following.

Designing for Distrust

Reading an Architecture as an Argument About Failure

By Rina Takahashi— February 25, 2026

During their January 2026 Launch Week, Skyvern shipped a feature that lets users upload a PDF of a human standard operating procedure and have an AI agent build a workflow from it. A procedure manual written for people, fed directly to a machine. It's a revealing choice, and what it compensates for runs through the rest of the architecture.

That compensation points at something specific. Skyvern's architecture, from its three-phase task pipeline to where it runs its benchmarks, encodes a particular understanding of where web agent workflows actually break. Three design decisions, three different failure modes. The thread connecting them is worth following.

The Binding Constraint

The Math That Quietly Decides Which Agent Workflows Survive

By Rina Takahashi— February 25, 2026

Feature image for article: The Math That Quietly Decides Which Agent Workflows Survive

A single step succeeds 95% of the time. Chain twenty of those steps together and the workflow completes 36% of the time. No individual component failed. Multiplication just compounded what looked, at each decision point, like a rounding error.

The industry's benchmarks and product roadmaps mostly optimize for what happens inside each step. Better reasoning, longer context. Meanwhile, organizations actually shipping agents into production are doing something quieter: shortening their workflows until the arithmetic stops punishing them. The distance between what gets designed on a whiteboard and what gets deployed says a lot about where automation economics actually bind.

The Binding Constraint

The Math That Quietly Decides Which Agent Workflows Survive

A single step succeeds 95% of the time. Chain twenty of those steps together and the workflow completes 36% of the time. No individual component failed. Multiplication just compounded what looked, at each decision point, like a rounding error.

The industry's benchmarks and product roadmaps mostly optimize for what happens inside each step. Better reasoning, longer context. Meanwhile, organizations actually shipping agents into production are doing something quieter: shortening their workflows until the arithmetic stops punishing them. The distance between what gets designed on a whiteboard and what gets deployed says a lot about where automation economics actually bind.

Rina Takahashi

Rina Takahashi, 37, former marketplace operations engineer turned enterprise AI writer. Built and maintained web-facing automations at scale for travel and e-commerce platforms. Now writes about reliable web agents, observability, and production-grade AI infrastructure at TinyFish.

Following the Math

The SRE Whose Dashboards Were Green While Everything Was Wrong

Following the Math

The SRE Whose Dashboards Were Green While Everything Was Wrong

The Measurement Gap

Agent Observability Has a 37-Point Blind Spot Between "Did It Run?" and "Did It Work?"

Among organizations running agents in production, 89% have observability in place. Only 52% run evaluations of any kind. That 37-point spread is a confidence gap masquerading as coverage.

Observability tracks what happened: latency, token counts, tool calls, error rates. It confirms execution. Every step glows green. But when an agent corrupts its own context at turn six and compounds that corruption through turn eleven, step-level tracing registers nothing wrong. The reasoning broke, not the infrastructure.

Significant capital is flowing toward closing this gap. Whether the instruments being built can measure what actually matters is a harder question than the investment suggests.

The Measurement Gap

Agent Observability Has a 37-Point Blind Spot Between "Did It Run?" and "Did It Work?"

Among organizations running agents in production, 89% have observability in place. Only 52% run evaluations of any kind. That 37-point spread is a confidence gap masquerading as coverage.

Observability tracks what happened: latency, token counts, tool calls, error rates. It confirms execution. Every step glows green. But when an agent corrupts its own context at turn six and compounds that corruption through turn eleven, step-level tracing registers nothing wrong. The reasoning broke, not the infrastructure.

Significant capital is flowing toward closing this gap. Whether the instruments being built can measure what actually matters is a harder question than the investment suggests.

TAKE NOTE

Capital surge: Arize AI's $70M Series C in early 2025 was the largest AI observability raise ever, backed by Datadog and PagerDuty

Infrastructure layer: ClickHouse acquired open-source Langfuse during its $400M Series D, folding observability into core database plumbing

Evaluator paradox: 53% of teams use LLM-as-judge for evaluation, but the judge inherits the same non-determinism as the system under review

Conversation-level tracking: LangSmith now evaluates full multi-turn agent sessions for semantic intent and task completion, moving past single-trace inspection

Human anchor: 60% of teams still require human review for high-stakes evaluation, suggesting automated quality judgment has a ceiling nobody's named