Self-Healing: Recovering Pruned Large Reasoning Models via Reinforcement Learning

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: pruning, large reasoning models, post training, math reasoning
Abstract: As large reasoning models (LRMs) achieve breakthroughs in reasoning tasks, building lightweight and efficient LRMs has become an urgent need for real-world applications. While structured pruning improves efficiency by reducing parameters, it often leads to significant performance degradation. To mitigate this loss, existing methods typically rely on next-token prediction, especially supervised fine-tuning (SFT) for recovery training. However, the effectiveness of pruning and recovery training in LRMs remains underexplored. Our empirical study shows that while structured pruning degrades the mathematical reasoning ability of LRMs, it does not completely destroy it, leaving room for compensation through recovery training. Existing recovery methods merely imitate reasoning trajectories in the training data, leading to performance bottlenecks and low data efficiency. To address this, we introduce reinforcement learning with verifiable reward (RLVR) for recovery training, enabling pruned LRMs to achieve self-healing performance. Experiments on five representative LRMs across six mathematical reasoning benchmarks show that RLVR significantly outperforms SFT-based recovery training. At 25\% compression, RLVR-based recovery training improves performance from around 80\% (with SFT) to over 95\%, approaching or even outperforming the accuracy of unpruned LRMs while maintaining efficiency.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12375
Loading