Choices Speak Louder than Questions

ICLR 2026 Conference Submission17078 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language model, evaluation methodologies, multiple choice question
TL;DR: The paper shows that models are often biased by answer choices, and proposes Normalized Probability Shift by the Question (NPSQ), a new metric that isolates the influence of the question and is less affected by the composition of the answer choices.
Abstract: Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of \textit{choice sensitivity}, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called **Normalized Probability Shift by the Question (NPSQ)**, designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods — such as those based on log-likelihood or its length-normalized variant — are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 17078
Loading