Frameworks like AgentBench and AutoGen Evaluation handle what traditional QA can't: agents producing different outputs from identical inputs. Purpose-built tools for conversation flows, tool verification, decision trees. Testing probabilistic systems requires different infrastructure.