Eliminating Discriminative Shortcuts in Multiple Choice Evaluations with Answer Matching

Published: 10 Jun 2025, Last Modified: 14 Jul 2025ICML 2025 World Models WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: QA, benchmarks, evaluations, llm-as-a-judge, alignment, grading
TL;DR: We show using language models to match free-form question responses to a reference answer aligns better with human evaluations than multiple choice and llm-as-a-judge..
Abstract: Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show that popular multiple-choice benchmarks admit superficial shortcuts that yield high accuracy without even looking at the questions, reflecting a fundamental limitation of discriminative evaluation not shared by evaluations of the model’s free- form, generative answers. To circumvent this issue, we consider a scalable method for generative evaluation, which we call answer matching: Give the candidate model the question with- out the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the answer matches the reference. Comparing multiple choice, “LLM-as-judge” without references, and answer-matching evaluations against human grading, we find that multiple-choice aligns poorly with humans, while answer matching using recent models — even small ones — achieves near-perfect alignment within inter-grader agreement. In light of this, we discuss how to move the evaluation format from multiple choice to answer matching.
Submission Number: 23
Loading