Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

Published: 25 Sept 2024, Last Modified: 15 Jan 2025NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Elo Rating System, Language Model Evaluation, Reliability, Robustness, Reproducibility, LLMs Ranking
TL;DR: This paper probes the Elo rating system in LLM evaluations, revealing its inherent volatility and providing empirical guidelines for ensuring robust and accurate model ranking in real-world scenarios.
Abstract: In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through "A vs B" paired comparisons. However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fundamental axioms that evaluation methods should adhere to: reliability and transitivity. We conduct an extensive evaluation of Elo behavior across simulated and real-world scenarios, demonstrating that individual Elo computations can exhibit significant volatility. We show that both axioms are not always satisfied, raising questions about the reliability of current comparative evaluations of LLMs. If the current use of Elo scores is intended to substitute the costly head-to-head comparison of LLMs, it is crucial to ensure the ranking is as robust as possible. Guided by the axioms, our findings offer concrete guidelines for enhancing the reliability of LLM evaluation methods, suggesting a need for reassessment of existing comparative approaches.
Primary Area: Evaluation (methodology, meta studies, replicability and validity)
Submission Number: 11462
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview