The Fault in Our LLM Leaderbaords

Published: 07 Jun 2025, Last Modified: 05 Aug 2025Practical-DL 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Leaderboard, Benchmark
Abstract: The rapid development of large language models (LLMs) has led to the creation of numerous benchmarks and leaderboards, assessing models' performance and ultimately guiding model selection. A key underlying assumption for model selection based on these benchmarks is that their measured performance is transferable for an LLM. More specifically, we expect similar tasks generated from different source distributions to exhibit similar rankings on a given set of LLMs. This work critically examines this assumption by evaluating the transferability of LLMs' ranking on common leaderboards to unseen target tasks. To this end, we systematically analyze the correlation between benchmark-based rankings and actual performance rankings on diverse target tasks, highlighting discrepancies that challenge the reliability of using benchmark-based rankings for model selection. Our results reveal that benchmark-based rankings, at best, moderately correlate with real-world performance, with correlation values often falling below 0.5.
Submission Number: 9
Loading