On the Limits of LLM Reasoning: Evidence From Contamination, Translation, and Answer Modification in Multiple-Choice Benchmarks

Eva Sánchez-Salido, Julio Gonzalo, Guillermo Marco

Published: 2026, Last Modified: 27 May 2026IEEE Access 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multiple-choice benchmarks are widely used to assess LLMs, yet their accuracy scores often conflate memorization—understood as pattern-based recall—with genuine reasoning, that is, inference beyond surface pattern transfer, especially when test sets are public and prone to contamination. To disentangle these effects, we evaluate models under three experimental conditions: 1) public (MMLU) vs. private (UNED-Access) data; 2) original vs. professionally translated questions (English/Spanish; less likely to appear verbatim in training data); and 3) an answer modification that replaces the correct option with “None of the other answers”—which becomes the right choice and dissociates success from previously seen tokens or concepts, requiring implicit inference steps. Across 16 proprietary and open-weights models, accuracy drops under answer modification are substantial (10%–93%), with larger declines on the public dataset (56% on MMLU vs. 51% on UNED-Access) and minimal differences between originals and translations. Taken together, contamination and translation emerge as second-order factors compared to the “None of the other answers” condition, suggesting that current LLMs generalize well across datasets and languages but show marked limitations when inference is required. Model size and baseline accuracy prove insufficient to predict robustness—although in low-contamination settings, accuracy becomes a more reliable indicator of inference-based behavior. Instead, training strategies explicitly targeting reasoning emerge as the primary drivers of robustness, with reasoning-oriented models consistently showing greater stability under the NOTO substitution.
Loading