None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks

None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks

ACL ARR 2025 July Submission706 Authors

28 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In LLM evaluations, a common strategy to probe cognitive abilities—beyond simple recall or memorization—involves introducing variations to multiple-choice questions, often by altering numbers in math tasks.In contrast, we propose a general variation method that fully dissociates the correct answer from any previously seen tokens or concepts, encouraging reasoning over memorization. Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets in English and Spanish: the public \textit{MMLU} benchmark and the private \textit{[anonymous dataset name]}. All models show substantial accuracy drops under our variation, averaging 56\% on MMLU and 51\% on [anonymous dataset], with losses ranging from 10\% to 93\%. Notably, the most accurate model (OpenAI-o3-mini) is not the most robust (DeepSeek-R1-70B), suggesting that top performers in standard benchmarks may lack stronger reasoning abilities. We also observe larger drops on public datasets and original-language questions (vs. manual translations), pointing to contamination and the role of memorization in current LLMs’ performance.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,automatic evaluation of datasets, evaluation methodologies, evaluation

Contribution Types: Model analysis & interpretability

Languages Studied: english, spanish

Previous URL: https://openreview.net/forum?id=7uM4hM6KBz

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Justification For Not Keeping Action Editor Or Reviewers: We respectfully request new reviewers for this phase, as two out of three reviewers in the previous round focused on peripheral aspects that fall outside the scope of our study (such as the use of multiple-choice questions, which is standard in LLM evaluations, or suggestions to test more datasets, which we acknowledge but deliberately postponed in favor of depth). We do not claim our method is a definitive measure of reasoning; rather, we frame it as a simple, dataset-agnostic proxy that helps identify memorization effects and enables more informative evaluations in saturated benchmarks. We believe this contribution complements existing work and deserves assessment with this framing in mind.

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 4

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: 4.1

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: 4.2

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 4.2

C3 Descriptive Statistics: Yes

C3 Elaboration: 5

C4 Parameters For Packages: Yes

C4 Elaboration: 4

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: ChatGPT was used to improve writing and obtain code suggestions.

Author Submission Checklist: yes

Submission Number: 706

Loading