Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

Published: 10 Jun 2025, Last Modified: 30 Jun 2025MoFA OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Preference-based Evaluations, Robustness to Data Dropping, Bradley--Terry Model, Influence Functions
TL;DR: We present a method for auditing the robustness of LLM ranking systems to worst-case data-dropping; we find that dropping just 0.02% of human (and AI) preferences can change the top-ranked models on Chatbot Arena.
Abstract: We propose a method for evaluating the robustness of a widely used LLM ranking system---the Bradley--Terry ranking system---to dropping a worst-case very small fraction of evaluation data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from two popular human-preference platforms, Chatbot Arena and MT-Bench, we find that the Bradley--Terry rankings of top-performing models are remarkably sensitive to the removal of a small fraction of evaluations. Our framework also identifies the specific evaluations most responsible for such ranking flips, allowing for inspections of these influential preferences. We observe that the rankings derived from MT-Bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench’s use of expert annotators and carefully constructed prompts. Finally, we find that rankings based on crowdsourced human-evaluated systems are just as sensitive as those based on LLM-as-a-judge evaluations, where in both, dropping as little as 0.02% of the total evaluations in the dataset can change the top-ranked model.
Submission Number: 42
Loading