AI AGENTSSUPPORT OPSEVALUATION

Your LLM Judge Is Lying
About Your Agent's Quality

You deployed a support agent. You're scoring it with GPT-4. The scores look great. Your customers disagree. Here's why — backed by 15+ papers from 2024–2025.

March 2026 · Verixa Lab

🚨 The False Confidence Problem

The metric your dashboard shows vs. what's actually happening

When you use an LLM to judge your support agent's responses, you're getting a number that feels precise but is systematically distorted by at least six independent biases — each one invisible in your dashboard, each one compounding on the others. The research is clear: that 4.2/5 quality score your agent shows? It could be a 3.1 under honest evaluation.

84%

Judge says

74%

After position correction

67%

After verbosity correction

61%

After self-pref correction

54%

Expert human review

Illustrative cascade based on documented bias magnitudes from Zheng et al. 2023, Shi et al. 2024, Wataoka et al. 2024

🪞 Bias #1 — The Judge Favors Itself

Your judge thinks its own family writes the best support answers

If you use GPT-4 to score your agent's GPT-4 outputs, you've built a system that literally reviews its own homework. Research confirms that models systematically rate their own outputs — and outputs from sibling models — higher than equally good alternatives. The cause is perplexity: LLMs prefer text that "sounds" like text they would write.

SELF-PREFERENCE BIAS BY MODEL FAMILY

GPT-4o

78%

Claude 3.5 Sonnet

72%

Llama-3 70B

65%

Cross-family judge

48%

Higher = stronger preference for own/family outputs. Based on 5,000+ prompt- completion pairs.

📄 Wataoka et al. — Self-Preference Bias in LLM-as-a-Judge (ICLR 2025)
📄 Spiliopoulou et al. — Play Favorites (arXiv: 2508.06709, Aug 2025)

What this means for your support agent: If your agent runs on GPT-4 and your judge is GPT-4, you have a structural conflict of interest baked into every evaluation. Quality scores are inflated by default — and you'll never know by how much without an external reference.

📍 Bias #2 — Order Decides the Winner

The order your agent's response appears in changes its score

In pairwise evaluations — "is response A or B better?" — LLM judges consistently pick the one that appears first. This isn't a subtle effect. Simply swapping the order of two identical-quality responses can flip the verdict. For support agents, this means your A/B test results between prompt versions or model configs may reflect presentation order, not actual quality.

ACCURACY SHIFT WHEN SWAPPING RESPONSE ORDER

12%

Code eval

General Q&A

Creative

Factual

10%

Support-style

Shift in judge accuracy purely from changing which response is shown first.

📄 Shi et al. — A Systematic Study of Position Bias (AACL-IJCNLP 2025)
📄 Jiang et al. — Pairwise code judging position shifts (2025)

📏 Bias #3 — Length Over Substance

Your agent learns that longer = higher score. Customers learn that longer = worse.

LLM judges reward verbosity. A support response that says "Your order has been refunded" scores lower than a five-paragraph essay that says the same thing with filler. Over time, your agent will optimize for what the judge rewards — producing bloated, over-explained responses that tank customer satisfaction even as judge scores climb.

THE VERBOSITY TRAP IN SUPPORT AGENTS

Judge Score vs Response Length

1–2 sentences62

3–5 sentences78

6–10 sentences88

10+ sentences91

Customer Satisfaction vs Length

1–2 sentences72

3–5 sentences85

6–10 sentences58

10+ sentences34

The judge and your customer are optimizing for opposite things. The judge rewards thoroughness. The customer wants their problem solved in 15 seconds.

📄 Saito et al. 2023 — Verbosity bias in GPT-4 / GPT-3.5-Turbo
📄 Justice or Prejudice? CALM Framework (arXiv: 2410.02736, 2024)

🏥 Bias #4 — Domain Expert Disagreement

The judge doesn't understand your domain. At all.

In support, accuracy isn't generic — it's domain-specific. A billing answer that's 95% right but gets the refund policy wrong is a catastrophic failure. LLM judges can't tell the difference. Across specialized fields, judges agree with human experts barely more than a coin flip.

LLM JUDGE vs. HUMAN EXPERT AGREEMENT BY DOMAIN

Code Translation81%

Usable

Structured, verifiable outputs

Dietetics / Nutrition68%

Marginal — misses nuanced clinical guidance

Mental Health64%

Dangerous gap for sensitive support topics

Legal / Compliance54%

Judge diverged from exam committee assessments

Code Summarization42%

Unstructured output → judges fail

Multilingual Support30%

Fleiss' κ ≈ 0.3 — essentially random across languages

📄 ACM IUI 2025 — Limitations of LLM-as-a-Judge in Expert Knowledge Tasks
📄 arXiv: 2511.04205 — LLM judges diverged from Polish legal exam committee assessments
📄 Fu et al. 2025 — Multilingual evaluation consistency
📄 Wang et al. 2025 — Code translation vs summarization gap

🧩 Bias #5 — Invisible to Logic Errors

Fluent ≠ correct. Your judge can't tell the difference.

This is the most dangerous failure mode for support agents. LLM judges evaluate surface quality — fluency, structure, tone — not factual correctness. A support response that confidently gives the wrong refund policy, cites a nonexistent help article, or hallucinates a product feature will score well because it sounds right.

WHAT LLM JUDGES CATCH vs. WHAT THEY MISS

✓ Judges catch reliably

Tone / politeness

Grammar / fluency

Response structure

General relevance

Format compliance

✗ Judges miss consistently

Wrong policy cited

Hallucinated article links

Incorrect pricing/dates

Flawed refund logic

Fabricated product features

In clinical settings, LLMs repeat or elaborate on planted false information in up to 83% of cases. In legal evaluation, LLM judges accepted responses that cited nonexistent legal provisions. If your support agent hallucinates a return policy, the judge will rate it 4/5 for helpfulness.

📄 Nature Communications Medicine 2025 — 83% adversarial hallucination repeat rate
📄 Goodeye Labs 2025 — Judges miss logic errors experts catch easily
📄 W&B 2025 — Judges rely on heuristics, not verifiable evidence
📄 Trend Micro 2025 — Judges fail when external knowledge verification is needed

🎯 Bias #6 — The Judge Can Be Gamed

Your agent will learn to cheat the exam.

If you optimize your agent against an LLM judge (via RLHF, prompt tuning, or any feedback loop), the agent will eventually learn to exploit the judge's weaknesses — not improve actual quality. Formatting tricks, confident tone, verbose padding, and keyword stuffing all reliably increase judge scores without improving the customer experience.

ATTACK SUCCESS RATES ON LLM JUDGES

86%

JudgeDeceiver (optimized)

62%

Prompt injection

58%

Verbose padding

51%

Format manipulation

44%

Keyword stuffing

39%

Confident tone

Rate at which these techniques successfully inflated judge scores above honest assessment.

📄 Deepchecks 2025 — JudgeDeceiver optimization-based attacks
📄 Li et al. 2025 — Composite attacks on Alibaba PAI-Judge
📄 Raina et al. — Universal Adversarial Attacks on LLM Judges (EMNLP 2024)

The RLHF trap: If your agent fine-tunes against judge feedback, it's not learning to help customers — it's learning to write essays that impress GPT-4. Scores go up. CSAT goes down. You won't know until the tickets pile up.

🎛️ Bias #7 — Scoring Roulette

Change one word in your rubric. Watch the score swing 20%.

LLM judge scores are fragile. The same response, evaluated with the same model, can receive wildly different scores based on minor prompt variations — rubric order, numbering format, or how you phrase "quality." This makes week-over-week comparisons meaningless and A/B tests unreliable.

SCORE VARIANCE FROM PROMPT PERTURBATIONS

Rubric order shuffled

±18%

Numeric → Roman IDs

±14%

'Quality' to 'Helpfulness'

±11%

Reference answer changed

±20%

Same prompt, different run

±7%

Spread shows correlation shift with human judgments from each perturbation type.

📄 Li et al. 2025 — Score sensitivity up to ±0.2 correlation shift
📄 Monte Carlo Data 2025 — 1 in 10 evaluations produce unreliable results
📄 Schroeder & Wood-Doughty 2024 — McDonald's omega reliability measure

🌍 Bias #8 — Non-Determinism × Multilingual Collapse

Same input. Different day. Different score. Different language? Forget about it.

LLM judges are non-deterministic — run the same evaluation three times, get three different scores. Multiply this by multilingual support queues and you get evaluation noise so high it drowns out real signal. For low-resource languages, judge consistency drops to Fleiss' κ ≈ 0.3 — barely above random chance.

CROSS-LANGUAGE JUDGE CONSISTENCY (FLEISS' κ)

78%

English

55%

Spanish

49%

German

38%

Japanese

28%

Hindi

19%

Swahili

κ < 0.4 = poor agreement. κ < 0.2 = essentially random. Most non-English languages fall below usable thresholds.

📄 Fu et al. 2025 — Multilingual judge consistency, avg κ ≈ 0.3

For global support teams: If your agent serves customers in 10+ languages and you evaluate all of them with the same LLM judge, you have reliable evaluation in maybe 2 of those languages. The rest is noise you're treating as signal.

🔥 The Full Damage Map

How all 8 failure modes compound in a live support agent

Failure Mode	Severity	Detection Difficulty	Support Impact
Subtle Error Blindness	🔴 Critical	Nearly impossible	Customers get wrong answers rated 4/5
Adversarial Gaming	🔴 Critical	Very hard	Agent optimizes for judge, not customer
Domain Misalignment	🔴 Critical	Hard	Policy/billing errors go undetected
Self-Preference Bias	🟠 High	Hard	Inflated scores mask real quality
Verbosity Bias	🟠 High	Medium	Agent produces bloated responses
Prompt Sensitivity	🟠 High	Medium	Scores incomparable across eval runs
Position Bias	🟡 Medium	Easy to detect	A/B tests produce false winners
Multilingual Collapse	🔴 Critical	Hard	Non-English quality is unmonitored

📚 References

Papers, surveys, and industry reports cited in this article

[1]

Wataoka, K. et al. (2024). Self-Preference Bias in LLM-as-a-Judge. ICLR 2025.https://arxiv.org/abs/2410.21819

Self-Preference

[2]

Spiliopoulou, E. et al. (2025). Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge.https://arxiv.org/abs/2508.06709

Self-Preference

[3]

Shi, W. et al. (2024). A Systematic Study of Position Bias in LLM-as-a-Judge. AACL-IJCNLP 2025.https://aclanthology.org/2025.ijcnlp-long.18.pdf

Position Bias

[4]

Jiang, D. et al. (2025). Pairwise Code Judging: Position Bias in Code Evaluation.

Position Bias

[5]

Saito, K. et al. (2023). Verbosity Bias in GPT-4 and GPT-3.5-Turbo.

Verbosity

[6]

Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Survey

[7]

Chen, W. et al. (2024). Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. CALM Framework.https://arxiv.org/abs/2410.02736

Bias Framework

[8]

ACM IUI 2025. Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks.https://dl.acm.org/doi/10.1145/3708359.3712091

Domain Gap

[9]

arXiv: 2511.04205 (2025). LLM-as-a-Judge is Bad — Polish National Board of Appeal Exam Study.https://arxiv.org/abs/2511.04205

Domain Gap

[10]

Wang, J. et al. (2025). LLM Judges for Software Engineering: Code Translation vs. Summarization.

Domain Gap

[11]

Fu, J. et al. (2025). Multilingual NLG Evaluation Abilities of LLM-based Evaluators.

Multilingual

[12]

Omar, M. et al. (2025). Multi-model Assurance Analysis: LLMs Vulnerable to Adversarial Hallucination Attacks. Nature Communications Medicine.https://www.nature.com/articles/s43856-025-01021-3

Hallucination

[13]

Raina, V. et al. (2024). Is LLM-as-a-Judge Robust? Universal Adversarial Attacks on Zero-shot LLM Assessment. EMNLP 2024.

Adversarial

[14]

Li, Z. et al. (2025). Composite Attacks on Commercial LLM-Judge Platforms (Alibaba PAI-Judge).

Adversarial

[15]

Li, H. et al. (2025). Score Sensitivity to Prompt Component Perturbations in LLM Judges.

Prompt Sensitivity

[16]

Schroeder, J. & Wood-Doughty, Z. (2024). McDonald's Omega as a Measure of LLM Evaluation Reliability.

Reliability

[17]

Gu, J. et al. (2024). A Survey on LLM-as-a-Judge. EMNLP 2025.https://aclanthology.org/2025.emnlp-main.138.pdf

Survey

[18]

ICLR 2025. LLM as Judge Won't Beat Twice the Data.https://proceedings.iclr.cc/paper_files/paper/2025/file/4264ee4376776907c0b87ed70b959585-Paper-Conference.pdf

Limitations

📰 Industry Reports

Blogs and industry analyses referenced

📎

Goodeye Labs (Dec 2025). 2025 Year in Review for LLM Evaluation.https://www.goodeyelabs.com/insights/llm-evaluation-2025-review

📎

Weights & Biases (Dec 2025). Exploring LLM-as-a-Judge.https://wandb.ai/site/articles/exploring-llm-as-a-judge/

📎

Monte Carlo Data (Nov 2025). LLM-As-Judge: 7 Best Practices & Evaluation Templates.https://www.montecarlodata.com/blog-llm-as-judge/

📎

Label Your Data (Dec 2025). LLM as a Judge: A 2026 Guide to Automated Model Assessment.https://labelyourdata.com/articles/llm-as-a-judge

📎

Deepchecks (Sep 2025). What Is LLM As A Judge? Strategies, Impact & Best Practices.https://www.deepchecks.com/what-is-llm-as-a-judge-strategies-impact-and-best-practices/

📎

Trend Micro (2025). LLM as a Judge: Evaluating Accuracy in LLM Security Scans.https://www.trendmicro.com/vinfo/us/security/news/managed-detection-and-response/llm-as-a-judge-evaluating-accuracy-in-llm-security-scans

📎

Confident AI. LLM-as-a-Judge Simply Explained: The Complete Guide.https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method

📎

Cameron R. Wolfe, Ph.D. (Jul 2024). Using LLMs for Evaluation. Substack.https://cameronrwolfe.substack.com/p/llm-as-a-judge

Your LLM Judge Is LyingAbout Your Agent's Quality

🚨 The False Confidence Problem

🪞 Bias #1 — The Judge Favors Itself

📍 Bias #2 — Order Decides the Winner

📏 Bias #3 — Length Over Substance

🏥 Bias #4 — Domain Expert Disagreement

🧩 Bias #5 — Invisible to Logic Errors

🎯 Bias #6 — The Judge Can Be Gamed

🎛️ Bias #7 — Scoring Roulette

🌍 Bias #8 — Non-Determinism × Multilingual Collapse

🔥 The Full Damage Map

📚 References

📰 Industry Reports

Your LLM Judge Is Lying
About Your Agent's Quality