Agent Evaluation Evolves Beyond Ad-Hoc Testing