Verixa Lab
AI AGENTSSUPPORT OPSEVALUATION

Your LLM Judge Is Lying
About Your Agent's Quality

You deployed a support agent. You're scoring it with GPT-4. The scores look great. Your customers disagree. Here's why โ€” backed by 15+ papers from 2024โ€“2025.

March 2026 ยท Verixa Lab

๐Ÿšจ The False Confidence Problem

The metric your dashboard shows vs. what's actually happening

When you use an LLM to judge your support agent's responses, you're getting a number that feels precise but is systematically distorted by at least six independent biases โ€” each one invisible in your dashboard, each one compounding on the others. The research is clear: that 4.2/5 quality score your agent shows? It could be a 3.1 under honest evaluation.

84%
Judge says
74%
After position correction
67%
After verbosity correction
61%
After self-pref correction
54%
Expert human review

Illustrative cascade based on documented bias magnitudes from Zheng et al. 2023, Shi et al. 2024, Wataoka et al. 2024

๐Ÿชž Bias #1 โ€” The Judge Favors Itself

Your judge thinks its own family writes the best support answers

If you use GPT-4 to score your agent's GPT-4 outputs, you've built a system that literally reviews its own homework. Research confirms that models systematically rate their own outputs โ€” and outputs from sibling models โ€” higher than equally good alternatives. The cause is perplexity: LLMs prefer text that "sounds" like text they would write.

SELF-PREFERENCE BIAS BY MODEL FAMILY
GPT-4o
78%
Claude 3.5 Sonnet
72%
Llama-3 70B
65%
Cross-family judge
48%

Higher = stronger preference for own/family outputs. Based on 5,000+ prompt- completion pairs.

๐Ÿ“„ Wataoka et al. โ€” Self-Preference Bias in LLM-as-a-Judge (ICLR 2025)
๐Ÿ“„ Spiliopoulou et al. โ€” Play Favorites (arXiv: 2508.06709, Aug 2025)

What this means for your support agent: If your agent runs on GPT-4 and your judge is GPT-4, you have a structural conflict of interest baked into every evaluation. Quality scores are inflated by default โ€” and you'll never know by how much without an external reference.

๐Ÿ“ Bias #2 โ€” Order Decides the Winner

The order your agent's response appears in changes its score

In pairwise evaluations โ€” "is response A or B better?" โ€” LLM judges consistently pick the one that appears first. This isn't a subtle effect. Simply swapping the order of two identical-quality responses can flip the verdict. For support agents, this means your A/B test results between prompt versions or model configs may reflect presentation order, not actual quality.

ACCURACY SHIFT WHEN SWAPPING RESPONSE ORDER
12%
Code eval
8%
General Q&A
6%
Creative
9%
Factual
10%
Support-style

Shift in judge accuracy purely from changing which response is shown first.

๐Ÿ“„ Shi et al. โ€” A Systematic Study of Position Bias (AACL-IJCNLP 2025)
๐Ÿ“„ Jiang et al. โ€” Pairwise code judging position shifts (2025)

๐Ÿ“ Bias #3 โ€” Length Over Substance

Your agent learns that longer = higher score. Customers learn that longer = worse.

LLM judges reward verbosity. A support response that says "Your order has been refunded" scores lower than a five-paragraph essay that says the same thing with filler. Over time, your agent will optimize for what the judge rewards โ€” producing bloated, over-explained responses that tank customer satisfaction even as judge scores climb.

THE VERBOSITY TRAP IN SUPPORT AGENTS
Judge Score vs Response Length
1โ€“2 sentences62
3โ€“5 sentences78
6โ€“10 sentences88
10+ sentences91
Customer Satisfaction vs Length
1โ€“2 sentences72
3โ€“5 sentences85
6โ€“10 sentences58
10+ sentences34

The judge and your customer are optimizing for opposite things. The judge rewards thoroughness. The customer wants their problem solved in 15 seconds.

๐Ÿ“„ Saito et al. 2023 โ€” Verbosity bias in GPT-4 / GPT-3.5-Turbo
๐Ÿ“„ Justice or Prejudice? CALM Framework (arXiv: 2410.02736, 2024)

๐Ÿฅ Bias #4 โ€” Domain Expert Disagreement

The judge doesn't understand your domain. At all.

In support, accuracy isn't generic โ€” it's domain-specific. A billing answer that's 95% right but gets the refund policy wrong is a catastrophic failure. LLM judges can't tell the difference. Across specialized fields, judges agree with human experts barely more than a coin flip.

LLM JUDGE vs. HUMAN EXPERT AGREEMENT BY DOMAIN
Code Translation81%
Usable
Structured, verifiable outputs
Dietetics / Nutrition68%
Marginal โ€” misses nuanced clinical guidance
Mental Health64%
Dangerous gap for sensitive support topics
Legal / Compliance54%
Judge diverged from exam committee assessments
Code Summarization42%
Unstructured output โ†’ judges fail
Multilingual Support30%
Fleiss' ฮบ โ‰ˆ 0.3 โ€” essentially random across languages
๐Ÿ“„ ACM IUI 2025 โ€” Limitations of LLM-as-a-Judge in Expert Knowledge Tasks
๐Ÿ“„ arXiv: 2511.04205 โ€” LLM judges diverged from Polish legal exam committee assessments
๐Ÿ“„ Fu et al. 2025 โ€” Multilingual evaluation consistency
๐Ÿ“„ Wang et al. 2025 โ€” Code translation vs summarization gap

๐Ÿงฉ Bias #5 โ€” Invisible to Logic Errors

Fluent โ‰  correct. Your judge can't tell the difference.

This is the most dangerous failure mode for support agents. LLM judges evaluate surface quality โ€” fluency, structure, tone โ€” not factual correctness. A support response that confidently gives the wrong refund policy, cites a nonexistent help article, or hallucinates a product feature will score well because it sounds right.

WHAT LLM JUDGES CATCH vs. WHAT THEY MISS
โœ“ Judges catch reliably
Tone / politeness
Grammar / fluency
Response structure
General relevance
Format compliance
โœ— Judges miss consistently
Wrong policy cited
Hallucinated article links
Incorrect pricing/dates
Flawed refund logic
Fabricated product features

In clinical settings, LLMs repeat or elaborate on planted false information in up to 83% of cases. In legal evaluation, LLM judges accepted responses that cited nonexistent legal provisions. If your support agent hallucinates a return policy, the judge will rate it 4/5 for helpfulness.

๐Ÿ“„ Nature Communications Medicine 2025 โ€” 83% adversarial hallucination repeat rate
๐Ÿ“„ Goodeye Labs 2025 โ€” Judges miss logic errors experts catch easily
๐Ÿ“„ W&B 2025 โ€” Judges rely on heuristics, not verifiable evidence
๐Ÿ“„ Trend Micro 2025 โ€” Judges fail when external knowledge verification is needed

๐ŸŽฏ Bias #6 โ€” The Judge Can Be Gamed

Your agent will learn to cheat the exam.

If you optimize your agent against an LLM judge (via RLHF, prompt tuning, or any feedback loop), the agent will eventually learn to exploit the judge's weaknesses โ€” not improve actual quality. Formatting tricks, confident tone, verbose padding, and keyword stuffing all reliably increase judge scores without improving the customer experience.

ATTACK SUCCESS RATES ON LLM JUDGES
86%
JudgeDeceiver (optimized)
62%
Prompt injection
58%
Verbose padding
51%
Format manipulation
44%
Keyword stuffing
39%
Confident tone

Rate at which these techniques successfully inflated judge scores above honest assessment.

๐Ÿ“„ Deepchecks 2025 โ€” JudgeDeceiver optimization-based attacks
๐Ÿ“„ Li et al. 2025 โ€” Composite attacks on Alibaba PAI-Judge
๐Ÿ“„ Raina et al. โ€” Universal Adversarial Attacks on LLM Judges (EMNLP 2024)

The RLHF trap: If your agent fine-tunes against judge feedback, it's not learning to help customers โ€” it's learning to write essays that impress GPT-4. Scores go up. CSAT goes down. You won't know until the tickets pile up.

๐ŸŽ›๏ธ Bias #7 โ€” Scoring Roulette

Change one word in your rubric. Watch the score swing 20%.

LLM judge scores are fragile. The same response, evaluated with the same model, can receive wildly different scores based on minor prompt variations โ€” rubric order, numbering format, or how you phrase "quality." This makes week-over-week comparisons meaningless and A/B tests unreliable.

SCORE VARIANCE FROM PROMPT PERTURBATIONS
Rubric order shuffled
ยฑ18%
Numeric โ†’ Roman IDs
ยฑ14%
'Quality' to 'Helpfulness'
ยฑ11%
Reference answer changed
ยฑ20%
Same prompt, different run
ยฑ7%

Spread shows correlation shift with human judgments from each perturbation type.

๐Ÿ“„ Li et al. 2025 โ€” Score sensitivity up to ยฑ0.2 correlation shift
๐Ÿ“„ Monte Carlo Data 2025 โ€” 1 in 10 evaluations produce unreliable results
๐Ÿ“„ Schroeder & Wood-Doughty 2024 โ€” McDonald's omega reliability measure

๐ŸŒ Bias #8 โ€” Non-Determinism ร— Multilingual Collapse

Same input. Different day. Different score. Different language? Forget about it.

LLM judges are non-deterministic โ€” run the same evaluation three times, get three different scores. Multiply this by multilingual support queues and you get evaluation noise so high it drowns out real signal. For low-resource languages, judge consistency drops to Fleiss' ฮบ โ‰ˆ 0.3 โ€” barely above random chance.

CROSS-LANGUAGE JUDGE CONSISTENCY (FLEISS' ฮบ)
78%
English
55%
Spanish
49%
German
38%
Japanese
28%
Hindi
19%
Swahili

ฮบ < 0.4 = poor agreement. ฮบ < 0.2 = essentially random. Most non-English languages fall below usable thresholds.

๐Ÿ“„ Fu et al. 2025 โ€” Multilingual judge consistency, avg ฮบ โ‰ˆ 0.3

For global support teams: If your agent serves customers in 10+ languages and you evaluate all of them with the same LLM judge, you have reliable evaluation in maybe 2 of those languages. The rest is noise you're treating as signal.

๐Ÿ”ฅ The Full Damage Map

How all 8 failure modes compound in a live support agent

Failure ModeSeverityDetection DifficultySupport Impact
Subtle Error Blindness๐Ÿ”ด CriticalNearly impossibleCustomers get wrong answers rated 4/5
Adversarial Gaming๐Ÿ”ด CriticalVery hardAgent optimizes for judge, not customer
Domain Misalignment๐Ÿ”ด CriticalHardPolicy/billing errors go undetected
Self-Preference Bias๐ŸŸ  HighHardInflated scores mask real quality
Verbosity Bias๐ŸŸ  HighMediumAgent produces bloated responses
Prompt Sensitivity๐ŸŸ  HighMediumScores incomparable across eval runs
Position Bias๐ŸŸก MediumEasy to detectA/B tests produce false winners
Multilingual Collapse๐Ÿ”ด CriticalHardNon-English quality is unmonitored

๐Ÿ“š References

Papers, surveys, and industry reports cited in this article

[1]
Wataoka, K. et al. (2024). Self-Preference Bias in LLM-as-a-Judge. ICLR 2025.https://arxiv.org/abs/2410.21819
Self-Preference
[2]
Spiliopoulou, E. et al. (2025). Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge.https://arxiv.org/abs/2508.06709
Self-Preference
[3]
Shi, W. et al. (2024). A Systematic Study of Position Bias in LLM-as-a-Judge. AACL-IJCNLP 2025.https://aclanthology.org/2025.ijcnlp-long.18.pdf
Position Bias
[4]
Jiang, D. et al. (2025). Pairwise Code Judging: Position Bias in Code Evaluation.
Position Bias
[5]
Saito, K. et al. (2023). Verbosity Bias in GPT-4 and GPT-3.5-Turbo.
Verbosity
[6]
Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Survey
[7]
Chen, W. et al. (2024). Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. CALM Framework.https://arxiv.org/abs/2410.02736
Bias Framework
[8]
ACM IUI 2025. Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks.https://dl.acm.org/doi/10.1145/3708359.3712091
Domain Gap
[9]
arXiv: 2511.04205 (2025). LLM-as-a-Judge is Bad โ€” Polish National Board of Appeal Exam Study.https://arxiv.org/abs/2511.04205
Domain Gap
[10]
Wang, J. et al. (2025). LLM Judges for Software Engineering: Code Translation vs. Summarization.
Domain Gap
[11]
Fu, J. et al. (2025). Multilingual NLG Evaluation Abilities of LLM-based Evaluators.
Multilingual
[12]
Omar, M. et al. (2025). Multi-model Assurance Analysis: LLMs Vulnerable to Adversarial Hallucination Attacks. Nature Communications Medicine.https://www.nature.com/articles/s43856-025-01021-3
Hallucination
[13]
Raina, V. et al. (2024). Is LLM-as-a-Judge Robust? Universal Adversarial Attacks on Zero-shot LLM Assessment. EMNLP 2024.
Adversarial
[14]
Li, Z. et al. (2025). Composite Attacks on Commercial LLM-Judge Platforms (Alibaba PAI-Judge).
Adversarial
[15]
Li, H. et al. (2025). Score Sensitivity to Prompt Component Perturbations in LLM Judges.
Prompt Sensitivity
[16]
Schroeder, J. & Wood-Doughty, Z. (2024). McDonald's Omega as a Measure of LLM Evaluation Reliability.
Reliability
[17]
Gu, J. et al. (2024). A Survey on LLM-as-a-Judge. EMNLP 2025.https://aclanthology.org/2025.emnlp-main.138.pdf
Survey

๐Ÿ“ฐ Industry Reports

Blogs and industry analyses referenced

๐Ÿ“Ž
Goodeye Labs (Dec 2025). 2025 Year in Review for LLM Evaluation.https://www.goodeyelabs.com/insights/llm-evaluation-2025-review
๐Ÿ“Ž
Weights & Biases (Dec 2025). Exploring LLM-as-a-Judge.https://wandb.ai/site/articles/exploring-llm-as-a-judge/
๐Ÿ“Ž
Monte Carlo Data (Nov 2025). LLM-As-Judge: 7 Best Practices & Evaluation Templates.https://www.montecarlodata.com/blog-llm-as-judge/
๐Ÿ“Ž
Label Your Data (Dec 2025). LLM as a Judge: A 2026 Guide to Automated Model Assessment.https://labelyourdata.com/articles/llm-as-a-judge
๐Ÿ“Ž
Deepchecks (Sep 2025). What Is LLM As A Judge? Strategies, Impact & Best Practices.https://www.deepchecks.com/what-is-llm-as-a-judge-strategies-impact-and-best-practices/
๐Ÿ“Ž
Confident AI. LLM-as-a-Judge Simply Explained: The Complete Guide.https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method
๐Ÿ“Ž
Cameron R. Wolfe, Ph.D. (Jul 2024). Using LLMs for Evaluation. Substack.https://cameronrwolfe.substack.com/p/llm-as-a-judge