Traditional CI/CD expects identical inputs to produce identical outputs. Agents generate different responses to the same query based on temperature, context windows, or model updates. Pass/fail testing catches complete failures but misses quality variations that actually matter.