Small language models like Luna-2 provide research-backed metrics for agent-specific challenges. They measure tool selection quality, error detection, and action advancement. The models distinguish tool execution failures from agent usage errors, pointing teams toward targeted fixes.