Automated evaluation scales beautifully but only if it correlates with human judgment. Teams optimize for LLM-as-judge scores without validating them against real user assessment. Then production performance disappoints. Human evaluation is slow and expensive. It's also the gold standard.