Tournament Evaluation of Large Language Models

Richard Kelley; Duncan Wilson

Tournament Evaluation of Large Language Models

Richard Kelley, Duncan Wilson

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: evaluation, large language models, Elo ratings, metrics, benchmarks

TL;DR: We present a method for evaluating language models by comparing them in tournaments that are automatically constructed from benchmarks.

Abstract: For several decades, the standard approach to evaluating a learned model has been to compute a numerical loss that summarizes the quality of the model based on a previously unseen test set. Two models for the same task can then be compared by looking at their scores on this set. However, recent experience with large language models (LLMs) has shown that comparing summary statistics of two broadly-capable models may not provide a reliable predictor of performance on real-world tasks. This has led to a growing use of crowd-sourced human feedback directly comparing outputs from pairs of models. While helpful, this approach requires a process that involves significant time and human effort, limiting the number of models that can be thoroughly evaluated. To address the need for a scalable method of comparing modern LLMs, we present a novel approach to evaluation via tournament-style model competitions that are constructed automatically from pre-existing benchmarks. We use these automatically-constructed tournaments to compute ratings for a range of models on a diverse set of tasks that use automated scoring via both multiple-choice and free-form text generation. We compare four prominent rating systems: Elo, Glicko, TrueSkill$\texttrademark$, and the Bradley-Terry model, and find that automatically-constructed tournaments provide reliable information about the relative performance of LLMs while using only a fraction of the amount of data required by current benchmark-based evaluation methods. We discuss implications for model evaluations and propose future directions for large-scale LLM comparisons.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5207

Loading