Keywords: LLM Bias, LLM Evaluation, SATA, Debiasing, Reasoning
TL;DR: Select All That Apply Benchmark for Multiple Choice Questions for large language models
Abstract: Current large language model (LLM) evaluations primarily focus on single-answer tasks, whereas many real-world applications require identifying multiple correct answers. This capability remains underexplored due to the lack of dedicated evaluation frameworks. We introduce \method, a benchmark for evaluating LLMs on Select All That Apply (SATA) questions spanning six domains, including reading comprehension, legal reasoning, and biomedicine. Our evaluation of 32 models demonstrates substantial limitations: the strongest model achieves only 75.3% Jaccard Index and 41.8% exact match accuracy. We identify three systematic biases underlying these failures: (i) unselection bias: models systematically avoid certain correct answer choices; (ii) speculation bias: models include incorrect answers when uncertain; and (iii) count bias: models consistently underpredict the number of correct answers. To address these limitations, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding and abstention handling to guide models toward complete and accurate multi-answer selections. Choice funnel improves the accuracy of the exact match by up to 29% while reducing the inference cost by more than 64% compared to the existing approaches. We release \method and Choice Funnel to encourage the development of LLMs capable of robust decision-making in realistic multi-answer scenarios.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 14547
Loading