Keywords: LLM Evaluation, Non-Transitive Preferences, Condorcet Cycles, Evaluation Robustness, Evaluation Consistency, Model Scaling
TL;DR: Non-transitive preferences in LLM evaluation are not noise but structured signals that follow a negative binomial distribution, are modulated by linguistic features, and reveal a scale–consistency tradeoff.
Abstract: Large language models (LLMs) are increasingly deployed as judges of text quality, yet their verdicts often exhibit non-transitive preferences. We present a systematic study of Condorcet cycles as a diagnostic lens for LLM-as-a-judge. Across 688 debate motions and five frontier models (including the judge model itself, GPT-4o), we show that (i) cycle frequencies obey a tightly fitted negative binomial distribution ($R^2=0.9973$), (ii) linguistic properties such as syntactic complexity ($\beta=0.130$) and readability ($\beta=-0.085$) reliably modulate cycle formation (Poisson regression Pseudo $R^2=0.106$, $p < 0.001$), and (iii) models display a strong preliminary scale--consistency tradeoff (Pearson $r=0.924$): larger models achieve higher average rankings but participate in more inconsistency cycles. These findings reframe cycles from ``noise to be removed" into \emph{actionable diagnostics}, with practical metrics (stance-level: 7.19\%; motion-level: 14.10\%) and graph-theoretic tools that support reproducible, consistency-aware evaluation. This perspective opens a new dimension of scaling law research -- the scaling of consistency -- and provides actionable diagnostics for robust evaluation paradigms of LLMs in high-stakes settings.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1035
Loading