Unbiased Evaluation of Large Language Models from a Causal Perspective

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper is the first to identify the biases in Agents-as-an-Evaluator and propose the Unbiased Evaluator, an evaluation protocol that delivers a more comprehensive, unbiased, and interpretable assessment of LLMs.
Abstract: Benchmark contamination has become a significant concern in the LLM evaluation community. Previous Agents-as-an-Evaluator address this issue by involving agents in the generation of questions. Despite their success, the biases in Agents-as-an-Evaluator methods remain largely unexplored. In this paper, we present a theoretical formulation of evaluation bias, providing valuable insights into designing unbiased evaluation protocols. Furthermore, we identify two type of bias in Agents-as-an-Evaluator through carefully designed probing tasks on a minimal Agents-as-an-Evaluator setup. To address these issues, we propose the Unbiased Evaluator, an evaluation protocol that delivers a more comprehensive, unbiased, and interpretable assessment of LLMs. Extensive experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only offers strong evidence of benchmark contamination but also provides interpretable evaluation results.
Lay Summary: As AI language models like ChatGPT get more powerful, it's important to test them in a fair and accurate way. But many of the tests used today may be "contaminated"—meaning the AI might have already seen some of the test questions during its training. This gives it an unfair head start and can make it seem smarter than it really is. In our research, we explain what these hidden biases are and why they matter. We then design simple experiments to reveal two common types of bias in current evaluation methods. To solve this problem, we introduce a new system called the Unbiased Evaluator. It checks whether a model truly understands a piece of knowledge by asking the same question in different ways. Just like a child is considered to know something only if they can answer all related questions correctly, our method ensures that the AI really "gets it" and isn't just guessing or remembering. Our results show that even today’s top AI models still have a lot of room to grow. We also demonstrate that our approach is better at spotting both hidden advantages and the true language understanding of these models.
Primary Area: Deep Learning->Large Language Models
Keywords: Agents-as-an-Evaluator, LLM Evaluation, Evaluation Bias
Submission Number: 10996
Loading