CURRENT | Foundations

Reading the Leaderboard

What a Benchmark Score Actually Contains

By Rina Takahashi— May 13, 2026

Feature image for article: What a Benchmark Score Actually Contains

Claude Opus 4 scores 64.9% on GAIA in one scaffold and 57.6% in another. Same model, same benchmark, same questions. The seven-point gap comes entirely from the orchestration wrapping the model, and it's larger than many consecutive frontier releases produce. Agent benchmark scores carry information about scaffold and token budget alongside model capability, often in roughly equal measure. The numbers tell you where engineering leverage lives, if you're reading all three variables.

Reading the Leaderboard

What a Benchmark Score Actually Contains

By Rina Takahashi— May 13, 2026

Claude Opus 4 scores 64.9% on GAIA in one scaffold and 57.6% in another. Same model, same benchmark, same questions. The seven-point gap comes entirely from the orchestration wrapping the model, and it's larger than many consecutive frontier releases produce. Agent benchmark scores carry information about scaffold and token budget alongside model capability, often in roughly equal measure. The numbers tell you where engineering leverage lives, if you're reading all three variables.

Choosing the Scaffold

Every Agent Framework Is an Argument About What's Hard

By Rina Takahashi— May 13, 2026

Feature image for article: Every Agent Framework Is an Argument About What's Hard

Framework comparison tables show you what's available. The interesting part starts when a container restarts mid-workflow at 3 a.m. LangGraph, CrewAI, the Claude Agent SDK, and Microsoft Agent Framework 1.0 each embed a different argument about what the hard problem in agent orchestration actually is — control flow, task delegation, permission boundaries, enterprise interoperability. Picking one means committing to a control philosophy that shapes how your system encounters problems you haven't anticipated yet. A better approach: name your dominant constraint first, then find the framework whose core abstraction matches it.

Choosing the Scaffold

Every Agent Framework Is an Argument About What's Hard

By Rina Takahashi— May 13, 2026

Framework comparison tables show you what's available. The interesting part starts when a container restarts mid-workflow at 3 a.m. LangGraph, CrewAI, the Claude Agent SDK, and Microsoft Agent Framework 1.0 each embed a different argument about what the hard problem in agent orchestration actually is — control flow, task delegation, permission boundaries, enterprise interoperability. Picking one means committing to a control philosophy that shapes how your system encounters problems you haven't anticipated yet. A better approach: name your dominant constraint first, then find the framework whose core abstraction matches it.

Further Reading

AI Evals Are Becoming the New Compute BottleneckThe cost data alone is worth the read: a 2% accuracy gap can mean 9x the spend, which reframes what "better" actually means on a leaderboard. Essential context on why benchmarks me...

Demystifying Evals for AI AgentsGives agent evaluation a shared vocabulary (task, trial, transcript, outcome) and draws a careful line between scoring the final result and scoring each step along the way. The lay...

Quick links

Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned

Agentic AI Frameworks 2026: LangGraph vs CrewAI vs OpenAI SDK

Agentic Orchestration: LangGraph vs CrewAI vs Mastra

Benchmark Decoder

A benchmark score looks like a fact. One number, one model, one position on a leaderboard. Clean and comparable.

Except underneath every agent benchmark result sit decisions that reshape the number before anyone publishes it. Which scaffold ran the test. How many times it ran. Whether the model had already seen the answers. What it cost to find out. These choices routinely move scores more than the model itself does.

Five cards, five benchmarks, five different ways the headline number hides the interesting part. The pattern holds across all of them: the context around the score carries more signal than the score itself.

Benchmark Decoder

A benchmark score looks like a fact. One number, one model, one position on a leaderboard. Clean and comparable.

Except underneath every agent benchmark result sit decisions that reshape the number before anyone publishes it. Which scaffold ran the test. How many times it ran. Whether the model had already seen the answers. What it cost to find out. These choices routinely move scores more than the model itself does.

Five cards, five benchmarks, five different ways the headline number hides the interesting part. The pattern holds across all of them: the context around the score carries more signal than the score itself.

Agent Reasoning

GAIA: Your Scaffold Choice Outweighs Your Model

GAIA evaluates general-purpose reasoning across web browsing, tool use, and multimodal tasks. What it won't tell you: how much credit belongs to the model. Swapping scaffolds shifts scores by roughly 30 points, far more than upgrading between frontier models. One run costs $2,829 before caching. When someone cites a GAIA number, the first question is which framework produced it.

Code Agents

SWE-bench: Verified Flatters, Pro Doesn't Care

Claude Opus 4.5 hits 80.9% on SWE-bench Verified and 45.9% on Pro. Three things compound: possible contamination (models reproducing verbatim patches), harder tasks in Pro requiring 10+ line changes, and Pro's controlled harness stripping away scaffold advantages. Verified measures something. Pro measures something closer to your actual codebase.

Agent Reliability

τ-bench: The Benchmark That Caught Consistency Lying

Sierra's τ-bench runs agents through customer service conversations, then runs them again. And again. GPT-4o passes a single attempt over 60% of the time. Require it to pass the same task across eight runs, and that number collapses below 25%. Most benchmarks report the generous number. τ-bench reports both, and the gap between them is the reliability problem production teams actually live with.

Web Agents

Online Mind2Web: Nine Times the Price, Same Result

Online Mind2Web tests agents on live websites that change underfoot. Browser-Use with Claude Sonnet 4 spent $1,577 to reach 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. Two percentage points apart, ninefold cost difference. The benchmark measures web task completion. It accidentally also measures how wildly scaffold-model pairing affects economics.

Meta Benchmark

HAL: Forty Thousand Dollars of Honest Accounting

Princeton's Holistic Agent Leaderboard burned $40,000 on 21,730 rollouts across 9 models and 9 benchmarks, standardizing scaffolds and tracking costs that individual leaderboards quietly omit. Independent sweeps revealed a 33x cost spread on identical tasks from scaffold choice alone. HAL has since paused accuracy updates entirely, pivoting to reliability measurement. Even the meta-benchmark decided the headline numbers weren't the point.

Framework Map

Nobody picks an agent framework by reading feature tables side by side. You pick one because something specific is blocking you: a workflow that regulators need to audit step by step, a task that has to survive a three-day pause while a human reviews it, a team of agents that keeps stepping on each other.

Start from the constraint, and the framework choice gets simpler. Each card below matches a dominant constraint to the framework whose core abstraction fits it best. Six constraints, six answers. Find the one that sounds like your Tuesday morning.

Framework Map

Nobody picks an agent framework by reading feature tables side by side. You pick one because something specific is blocking you: a workflow that regulators need to audit step by step, a task that has to survive a three-day pause while a human reviews it, a team of agents that keeps stepping on each other.

Start from the constraint, and the framework choice gets simpler. Each card below matches a dominant constraint to the framework whose core abstraction fits it best. Six constraints, six answers. Find the one that sounds like your Tuesday morning.

State Control

Full Auditability Over Every Transition: LangGraph

If your hardest problem is knowing exactly what happened at each step of a branching, retrying workflow, LangGraph's graph state machine saves typed snapshots at every super-step. Time-travel debugging and human-in-the-loop interrupts are built into the core abstraction, not layered on top.

Anthropic Native

Deep Claude Integration: The Agent SDK

If your hardest problem is tight coupling with Claude models, the Claude Agent SDK gives you four extension points wired directly into the model loop: MCP for tools, skills for reusable procedures, hooks for lifecycle control, subagents for parallel isolated work. Same architecture that runs Claude Code.

Role Coordination

Agents as Teammates: CrewAI Maps Roles Directly

If your hardest problem is getting specialized agents to coordinate without hand-drawing graph topology, CrewAI lets you define agents by role, goal, and backstory. Delegation happens across sequential, hierarchical, or consensual patterns. You can ship a working multi-agent system before sketching a single node.

Enterprise Ecosystem

Microsoft Consolidates: One Framework Replaces Two

If your hardest problem is enterprise buy-in within Azure, Microsoft Agent Framework 1.0 (GA April 2026) merges Semantic Kernel and AutoGen into a single .NET and Python SDK. Both predecessors are now in maintenance mode. Swapping between six model providers takes one line of code.

Durable Execution

Surviving Crashes and Pauses: Durability Across Frameworks

If your hardest problem is workflows that wait on humans for days or die mid-run, durability outranks every other feature. LangGraph checkpoints natively at every graph boundary. Microsoft Agent Framework 1.0 supports pause-and-resume across all orchestration patterns. Without durability, a crashed agent restarts from zero.

Triage Routing

Routing Requests to Specialists: OpenAI Agents SDK

If your hardest problem is triaging incoming requests to the right specialist agent with safety checks in place, OpenAI's Agents SDK treats handoffs and guardrails as native primitives with automatic tracing. Code-first, no graph to predefine. The tradeoff: less portable, and less suited to stateful long-running work.

Past Articles

What Happens When the Sandbox Costs More Than the Work Inside It

A container takes hundreds of milliseconds to start and hundreds of megabytes to hold. For a web service that runs for w...

The Routing Decision Nobody Governs

Every major agent tracing framework records four identity attributes: description, ID, name, version. None of them inclu...

Why Agent Infrastructure Starts Empty

In a single week this April, Google, AWS, Cloudflare, and CIS independently shipped agent infrastructure built around th...

OpenClaw's Diary Timeline and the Line Between Transparency and Control

OpenClaw's April 9 "Dreaming" update shipped a UI called the Diary Timeline. Browse it and you'll find daily notes sitti...

Past Articles

What Happens When the Sandbox Costs More Than the Work Inside It

A container takes hundreds of milliseconds to start and hundreds of megabytes to hold. For a web service that runs for w...

The Routing Decision Nobody Governs

Every major agent tracing framework records four identity attributes: description, ID, name, version. None of them inclu...

Why Agent Infrastructure Starts Empty

In a single week this April, Google, AWS, Cloudflare, and CIS independently shipped agent infrastructure built around th...

OpenClaw's Diary Timeline and the Line Between Transparency and Control

OpenClaw's April 9 "Dreaming" update shipped a UI called the Diary Timeline. Browse it and you'll find daily notes sitti...