RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

Xinyuan Li; Murong Xu; Wenbiao Tao; Hanlun Zhu; Yike Zhao; Jipeng Zhang; Yunshi Lan

RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

Xinyuan Li, Murong Xu, Wenbiao Tao, Hanlun Zhu, Yike Zhao, Jipeng Zhang, Yunshi Lan

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Robust Evaluation; Item Response Theory; Reinforcement Learning; Mathematical Reasoning

TL;DR: As LLM math performance can be inflated by data leakage and pattern matching, we introduce RIDE—an IRT-guided question-rewriting framework that perturbs competition problems to increase difficulty—leading to performance drops across many LLMs.

Abstract: Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation–based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ $35$ LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average $21.73\%$ drop across $26$ models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 1977

Loading