Research Area: Evaluation
Keywords: evaluation,llm,bias
TL;DR: A simple regression-based method to mitigate the length bias of AlpacaEval.
Abstract: LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation.
However, these auto-annotators can introduce complex biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remains in existing automated evaluation metrics.
We propose a simple regression analysis approach for controlling biases in auto-evaluations.
As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for instruction-following LLMs that uses LLMs to estimate response quality.
Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs.
We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?"
To achieve this, we first fit a GLM to predict the biased output of interest (auto-annotator preferences) based on the mediators we want to control for (length difference) and other relevant features.
We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths.
Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, we also find that it increases the Spearman correlation with LMSYS' Chatbot Arena from 0.94 to 0.98.
We release \thecode{} and \leaderboard{}.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 707
Loading