When a new agent benchmark result lands, the shorthand is almost always about the model. Claude scores 64.9% on GAIA. GPT-5 hits 42% on Mind2Web. The number attaches to the model name like a test grade.
But look at what's inside that number. Claude Opus 4 scores 64.9% on GAIA inside one orchestration scaffold and 57.6% inside another. Same model, same benchmark, same questions. The seven-point gap comes entirely from the harness wrapping the model: how it manages context, sequences tool calls, handles retries. That's larger than the improvement between many consecutive frontier model releases. On CORE-Bench, the HAL leaderboard recently declared the benchmark solved after a researcher submitted an updated scaffold using Claude Code. Same model, different orchestration, and the benchmark tipped from unsolved to solved.
The token-budget variable is stranger still. On Online Mind2Web, one scaffold running Claude Sonnet 4 cost $1,577 for 40% accuracy while a different scaffold running GPT-5 Medium hit 42% for $171. Nine times the cost, two percentage points of accuracy, and the expensive configuration lost. You might assume that's because the cheaper system was more efficient, and more reasoning effort would close the gap. But HAL's evaluation found that increasing reasoning effort reduced accuracy in 21 of 36 runs. More thinking made things worse more often than it helped. The budget lever doesn't pull in a predictable direction.
What the benchmark number contains, then, is a product of model × scaffold × token budget, and two of those three variables are doing at least as much work as the model itself. Treating the score as a model grade means ignoring most of the information it carries.
And there's a further wrinkle. If these scores describe system configurations, how stable are they? The HAL team seems to have arrived at this concern independently. Every sub-page on their leaderboard now carries the same notice: they've paused updating with new models to focus on measuring reliability. Their companion paper, "Towards a Science of AI Agent Reliability," evaluated 14 models across 18 months of releases and found that while accuracy climbed steadily, reliability barely moved on open-ended tasks like GAIA. Counterintuitively, larger models sometimes showed less consistency than smaller ones, as though a richer behavioral repertoire meant more ways to vary between runs.
An agent powered by GPT-4o achieves over 60% task success on a single τ-bench run. Run the same task eight times and the pass rate drops below 25%.
A system score that shifts depending on which scaffold wraps the model, which budget settings are applied, and which particular run you happen to be watching.
The scores are most informative when you read them as descriptions of specific system configurations under specific conditions. The scaffold choice, the token budget, the run-to-run variance: this is the most actionable information the benchmark produces, because it points to where engineering effort pays off. For most agent workloads right now, that's the orchestration layer. A practitioner looking at a leaderboard with that lens starts asking: what changed between the configurations that scored 57% and 65%, and can I build that difference into my system?
Things to follow up on...
- Evals as compute bottleneck: The HuggingFace EvalEval Coalition found that HAL spent roughly $40,000 on a single evaluation sweep across 21,730 agent rollouts, with individual GAIA runs costing nearly $3,000 before caching.
- SWE-bench's contamination problem: OpenAI confirmed that every frontier model shows training data contamination on SWE-bench Verified, leading them to stop reporting Verified scores in favor of the harder SWE-bench Pro, where the same Claude Opus 4.5 drops from 80.9% to 45.9%.
- Twelve metrics for reliability: The HAL team's companion paper proposes a concrete framework decomposing agent reliability into consistency, robustness, predictability, and safety, finding that capability gains over 18 months yielded only modest reliability improvement.
- Outcome scoring remains unsolved: Anthropic's engineering team argues that step-level tracing is the solved half of agent evaluation, while outcome scoring still requires domain experts to judge whether the agent actually accomplished the goal in context.

