None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks
Abstract: In LLM evaluations, a common strategy to probe cognitive abilities—beyond simple recall or memorization—involves introducing variations to multiple-choice questions, often by altering numbers in math tasks.In contrast, we propose a general variation method that fully dissociates the correct answer from any previously seen tokens or concepts, encouraging reasoning over memorization. Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets in English and Spanish: the public \textit{MMLU} benchmark and the private \textit{[anonymous dataset name]}. All models show substantial accuracy drops under our variation, averaging 56\% on MMLU and 51\% on [anonymous dataset], with losses ranging from 10\% to 93\%. Notably, the most accurate model (OpenAI-o3-mini) is not the most robust (DeepSeek-R1-70B), suggesting that top performers in standard benchmarks may lack stronger reasoning abilities. We also observe larger drops on public datasets and original-language questions (vs. manual translations), pointing to contamination and the role of memorization in current LLMs’ performance.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,automatic evaluation of datasets, evaluation methodologies, evaluation
Contribution Types: Model analysis & interpretability
Languages Studied: english, spanish
Previous URL: https://openreview.net/forum?id=7uM4hM6KBz
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: We respectfully request new reviewers for this phase, as two out of three reviewers in the previous round focused on peripheral aspects that fall outside the scope of our study (such as the use of multiple-choice questions, which is standard in LLM evaluations, or suggestions to test more datasets, which we acknowledge but deliberately postponed in favor of depth). We do not claim our method is a definitive measure of reasoning; rather, we frame it as a simple, dataset-agnostic proxy that helps identify memorization effects and enables more informative evaluations in saturated benchmarks. We believe this contribution complements existing work and deserves assessment with this framing in mind.
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 4
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: 4.1
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 4.2
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 4.2
C3 Descriptive Statistics: Yes
C3 Elaboration: 5
C4 Parameters For Packages: Yes
C4 Elaboration: 4
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: ChatGPT was used to improve writing and obtain code suggestions.
Author Submission Checklist: yes
Submission Number: 706
Loading