TL;DR: We investigate non-transitivity in LLM-based evaluation frameworks, showing that LLM judges exhibit non-transitive preferences, leading to ranking instability.
Abstract: Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0\% $\rightarrow$ 96.4\% and 82.1\% $\rightarrow$ 86.3\% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.
Lay Summary: When evaluating AI chatbots that follow human instructions, researchers often rely on automatic comparisons made by powerful language models. This typically involves comparing two chatbots at a time, based on an implicit assumption: if Chatbot A is better than Chatbot B, and Chatbot B is better than Chatbot C, then Chatbot A should also be better than Chatbot C. However, we find that this assumption does not always hold, and such inconsistencies can significantly distort the overall rankings of AI chatbots. We examine this issue in a widely used evaluation framework called AlpacaEval and observe clear evidence of these ranking inconsistencies. To address the problem, we introduce an evaluation method inspired by round-robin tournaments, where each chatbot is compared against many others. The outcomes are then aggregated using a statistical model called Bradley-Terry to produce more consistent and accurate rankings. This approach significantly improves the reliability of AI chatbot evaluations. To reduce the high computational cost of round-robin comparisons, we also propose a more efficient matching strategy called Swim tournaments, which preserves the benefits of round-robin evaluation while requiring far fewer comparisons.
Link To Code: https://github.com/yix8/llm-nontransitivity
Primary Area: Deep Learning->Large Language Models
Keywords: LLM-as-a-Judge, transitivity, pairwise comparison
Submission Number: 10869
Loading