Keywords: LLM-as-judge, Meta-evaluation, Evaluation theory, Anchored evaluation
TL;DR: LLMs as judges converge to consistent but biased evaluations, meta-evaluation collapse, and we show, both theoretically and empirically, that preventing this requires anchoring evaluations in human or formal ground-truth signals.
Abstract: Large language models (LLMs) are increasingly used as evaluators, yet their reliability as judges remains poorly understood. We introduce the concept of meta-evaluation collapse: recursive LLM-based evaluation converges toward internally consistent but fragile fixed points that are detached, from human or domain-grounded truth. Through an operator-theoretic analysis, we show that unanchored evaluation hierarchies inevitably contract to biased equilibria, either collapsing into trivial consensus or amplifying systematic preferences such as fluency over accuracy. Empirically, using multilingual health queries, we find that LLM judges display high inter-model agreement but drift sharply from human evaluators, compressing variance, inflating surface qualities, and overlooking cultural nuance. Comparative evaluations, often assumed more robust, further establish these biases. Our analysis highlights the risks of over-relying on LLM consensus and calls for anchored meta-evaluation frameworks that integrate human disagreement, cultural diversity, and task-specific grounding.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 25176
Loading