โ๏ธWhy LLM Judges Fail (and How to Fix Them)
LLM-as-judge is now the default eval method. Most implementations are unreliable. Here's why and what to do about it.
Teams ship AI features against eval scores that are themselves unreliable, getting false confidence about quality. Knowing the failure modes of LLM judges โ and the fixes โ separates production-grade AI work from theater.
LLM judges have known failure modes: length bias, self-preference, position bias, inconsistency on subjective criteria. Mitigations: specific rubrics, different judge model than the one being evaluated, multiple judges and median, periodic human calibration. Without these, the eval score is unreliable.
The five most common failures
1. Length bias. LLM judges systematically prefer longer answers, regardless of quality. Mitigation: explicitly score on conciseness, OR control for length in evaluation.
2. Self-preference. Claude rates Claude output higher than GPT output, and vice versa. Mitigation: use a different model than the one being evaluated. Better: use multiple judges and take the median.
3. Position bias. When comparing two responses, judges over-weight the first one shown. Mitigation: randomize order; run each comparison both ways and average.
4. Inconsistency on subjective criteria. "Is this answer helpful?" produces different scores across runs. Mitigation: very specific rubrics with examples; multiple runs per eval.
5. Calibration drift. Same judge model can score differently across model versions. Mitigation: human calibration set re-run periodically.
How to design a reliable LLM judge
Specific rubric, not vague criteria.
- Bad: "Is the answer good?"
- Good: "Score the answer 1-5. 5 = answers the question, cites source, no errors. 4 = answers correctly but doesn't cite. 3 = partial answer. 2 = misses the point. 1 = wrong or harmful."
Few-shot examples in the judge prompt. Provide 3-5 input-output-score examples showing what 5 looks like, what 3 looks like, what 1 looks like.
Force structured output. Judge returns JSON: { score: int, reasoning: string }. Reasoning makes errors debuggable.
Use a different judge model than the production model. Reduces self-preference. If production is Claude, judge with GPT, and vice versa.
Multiple judges + median for high-stakes evals. Run 3 judges (different models), take median. Costs 3x but dramatically more reliable.
The calibration practice
Every quarter:
- Take 50 representative evals
- Have 2-3 humans score each (3-5 ratings)
- Run the LLM judge on the same evals
- Compare. Where do they diverge?
- Refine the judge prompt to align with human judgment
This is laborious but it's how you know your judge is reliable.
When NOT to use LLM judges
- Highly subjective criteria (creative writing quality, humor) โ humans still better.
- High-stakes binary decisions (is this safe? is this PII?) โ humans + rules.
- Brand-voice fit โ humans who know the brand.
For most quantifiable AI quality questions in 2026, LLM judges are good enough with the right discipline. For nuanced ones, keep humans in the loop.
Real-world examples
By 2026, leading AI-native PM teams have moved to multi-judge eval systems โ 3 different judge models scoring the same eval, with the median used as the truth. The cost is 3x; the reliability is dramatically higher. Cursor, Anthropic, Linear all use variants of this.
Go deeper โ recommended reading
Interview questions (1)
Q1Your eval scores look great but users are complaining. What's wrong?ai-pmseniorโผ
Almost certainly the eval suite or judge is misaligned with real user experience.
Diagnostic in this order:
- Sample 20 real user complaints. Add them to the eval suite. Re-run. Do the new evals fail? If yes, the suite was missing important inputs.
- Check judge calibration. Take 30 evals where the judge said 'good.' Have a human re-score. If humans disagree often, the judge prompt is wrong (too vague, biased toward length, etc.).
- Check for surface gaps. What user behaviors does the eval suite NOT cover? Often the surface that fails wasn't represented.
- Check for production drift. Same model behaves differently in production due to context (chat history, retrieval results) that the eval doesn't simulate. Add production-like context to the eval setup.
The fix: refresh the eval suite quarterly based on production failures and complaints. Recalibrate the judge against human scores. Treat eval-suite quality as a first-class engineering investment.
The trap: trusting the eval score in absolute terms. The score is only meaningful if the suite + judge are aligned with reality.