SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

ICLR 2026 Conference Submission14547 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Bias, LLM Evaluation, SATA, Debiasing, Reasoning

TL;DR: Select All That Apply Benchmark for Multiple Choice Questions for large language models

Abstract: Current large language model (LLM) evaluations primarily focus on single-answer tasks, whereas many real-world applications require identifying multiple correct answers. This capability remains underexplored due to the lack of dedicated evaluation frameworks. We introduce \method, a benchmark for evaluating LLMs on Select All That Apply (SATA) questions spanning six domains, including reading comprehension, legal reasoning, and biomedicine. Our evaluation of 32 models demonstrates substantial limitations: the strongest model achieves only 75.3% Jaccard Index and 41.8% exact match accuracy. We identify three systematic biases underlying these failures: (i) unselection bias: models systematically avoid certain correct answer choices; (ii) speculation bias: models include incorrect answers when uncertain; and (iii) count bias: models consistently underpredict the number of correct answers. To address these limitations, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding and abstention handling to guide models toward complete and accurate multi-answer selections. Choice funnel improves the accuracy of the exact match by up to 29% while reducing the inference cost by more than 64% compared to the existing approaches. We release \method and Choice Funnel to encourage the development of LLMs capable of robust decision-making in realistic multi-answer scenarios.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 14547

Loading