Lost in Real-World Scenarios: Concretization Disrupts LLM Logical Reasoning

ICLR 2026 Conference Submission25024 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Reasoning Robustness, Input Formulation, Logical Reasoning
Abstract: Although large reasoning models have attracted significant attention, recent studies reveal that even minor variations in input formulation can lead to substantial inconsistencies in reasoning outcomes, underscoring their fragility in real-world scenarios. To systematically investigate this issue, we propose a concretization framework that automatically translates clean reasoning logic into concrete contexts with challenging formulations. In this framework, two translators are trained via a dual-learning approach. The first converts formal language templates into natural language puzzles, guided by a difficulty-aware reward that promotes the exploration of harder formulations. The second translates puzzles back into templates, with isomorphism verification ensuring the consistency of underlying reasoning logic. Applying this framework, we construct extensive paired datasets of formal language templates and natural language puzzles. Through evaluation, we observe a sharp decline in LLM reasoning performance when shifting from formal templates to natural language puzzles. To uncover the underlying causes, we conduct an in-depth analysis of how tokens derived from formal templates and natural language puzzles influence the final answers. This analysis reveals two primary sources of degradation: dispersed reasoning attention across non-essential tokens and conflicts introduced by alternative formulations. To address these issues, we propose a prompt-based approach that instructs LLMs to abstract reasoning logic from concrete contexts before attempting direct solutions, and a training-based approach that further strengthens LLMs’ abstraction ability. Experimental results show that our methods improve LLM performance on natural language puzzles by up to 56.2\%, nearly eliminating the performance loss induced by concretization.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 25024
Loading