How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark

How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark

ACL ARR 2025 May Submission1678 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: robustness, explanation faithfulness, hardness of samples, adversarial training

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Data resources, Data analysis

Languages Studied: English

Submission Number: 1678

Loading