Teams evaluate their agents and get confusing results. An agent that works perfectly in testing fails in production. Another succeeds at tasks but takes inefficient paths to get there. A third completes its designated role but scores zero on task completion metrics. The evaluation said "working" but didn't capture what kind of working mattered.
The confusion often comes from conflating measurement targets with measurement methods—what you're evaluating versus how you're running the evaluation.
What You're Measuring
Behavior evaluation measures outcomes. Did the agent complete the task? Did it produce correct output? WebArena exemplifies this approach: 812 tasks across e-commerce, forums, and content management, evaluating only whether agents reached the desired end state.
Capabilities evaluation examines how agents work: tool selection quality, planning steps, reasoning patterns, memory retention. An agent might complete tasks successfully while using fragile tool sequences or inefficient reasoning paths. Google's Agent Development Kit evaluates "the steps an agent takes to reach a solution, including its choice of tools, strategies, and the efficiency of its approach."
Reliability evaluation asks whether agents behave consistently. Same task, same way, multiple times? Performance maintained with varied inputs? Since language models are non-deterministic, agents exhibit natural variability. Reliability measures whether that variability stays within acceptable bounds.
These aspects operate independently. A coding agent might generate correct solutions but succeed only 30% of the time on first attempts. A support agent might gather information accurately while using tools inefficiently.
How You're Measuring
The measurement approach shapes what you can see.
Static evaluation uses fixed datasets in controlled environments—predetermined inputs checked against known answers. Fast, cheap, repeatable. Good for regression testing.
Interactive evaluation involves agents engaging with dynamic environments, handling multi-turn dialogue, asking clarification questions. The DETOUR benchmark uses dual-agent conversations where agents navigate ambiguity through questions rather than predetermined paths.
Anthropic's coding agents demonstrate how evaluation process affects results. Run them once per problem, they succeed 30% of the time. Let them try multiple approaches, that jumps to 80%. The evaluation process captures different aspects of capability.
Dataset choices create specific constraints. Synthetic datasets enable fast testing with clear ground truth. Real-world datasets like Mind2Web-live couple live browser demonstrations with actionable targets, but require scheduled revalidation every 4-8 weeks as selectors break and URLs change. The dataset type determines whether you're measuring performance against idealized scenarios or operational conditions.
Misaligned Objectives
A support workflow splits across two agents. The first gathers account details and verifies identity. The second processes the actual refund. Evaluate the first agent for task completion and it scores zero—it never issued money. But it did its job perfectly. The evaluation measures the wrong objective for that agent's role.
Reliability evaluation through pass@k metrics measures probability of success across k attempts. More attempts mean higher scores, but running multiple trials per input gets expensive fast. The evaluation process—how many trials you can afford—constrains what reliability questions you can answer.
Google's Agent Development Kit compares agent trajectories against expected step sequences, revealing errors in the process even when final outcomes are correct. Trajectory analysis shows tool selection, reasoning steps, and error recovery patterns—the agent's process beyond success rates.
What You're Actually Choosing
When you select behavior evaluation with static datasets, you're measuring whether agents achieve correct outcomes in controlled conditions. Reliability evaluation across repeated runs or handling of unexpected inputs requires different evaluation choices.
When you evaluate capabilities through trajectory analysis, you're examining the agent's process—how it selects tools, plans steps, recovers from errors.
Different use cases need different combinations. Occasional data extraction might need only behavior evaluation. Continuous monitoring systems need reliability evaluation across thousands of runs. Transactional workflows need behavior, reliability, and safety evaluation together.
Each evaluation choice reveals certain aspects of agent performance while leaving others unmeasured. Understanding which aspects matter for your use case determines which evaluation approach serves you.
Things to follow up on...
-
Benchmark saturation dynamics: As evaluation benchmarks approach saturation, large capability improvements appear as small score increases, making it harder to distinguish meaningful progress from incremental gains.
-
Enterprise evaluation gaps: Production deployments require domain-specific metrics reflecting compliance requirements and reliability guarantees that standard research benchmarks typically overlook.
-
LLM-as-judge reliability concerns: While widely used for scalable evaluation, LLM judges often overestimate success and miss important details when processing long, complex agent traces.
-
Multi-agent evaluation complexity: When workflows split across multiple agents, evaluating individual agents in isolation produces misleading results that don't reflect system-level performance.

