Keywords: compression, pruning, large language models, instruction following, large reasoning models, efficiency
TL;DR: We study pruning in both instruction-following and reasoning-augmented LLMs, revealing how pruning methods impact efficiency and performance, and offering guidelines for pruning in the reasoning era.
Abstract: Model pruning is a widely-used technique to reduce the significant computational cost of large language models (LLMs). However, existing research suffers from two key limitations: (1) pruning is typically evaluated post-hoc on datasets unrelated to the original training corpus, leaving it unclear if the model's general capabilities are preserved; and (2) it has focused almost exclusively on standard instruction-following models ($\textbf{LLM-instruct}$). The recent rise of reasoning-augmented models ($\textbf{LLM-think}$), which generate explicit chain-of-thought steps, presents an unstudied challenge for established pruning methods due to their substantially different generation patterns.
In this work, we conduct the first systematic investigation of pruning across both LLM-instruct and LLM-think families. We introduce a rigorous experimental framework that leverages the models' original training corpora for both pruning calibration and post-pruning recovery, enabling a more faithful assessment of performance preservation than prior work. Across a comprehensive suite of static and dynamic pruning methods evaluated on 17 diverse tasks, we find that the effectiveness of pruning strategies differs significantly between the two model families. Our results reveal that techniques optimized for concise instruction-following do not seamlessly transfer to preserving complex, multi-step reasoning. This work provides critical insights and practical guidelines for efficiently compressing the next generation of reasoning-augmented LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19558
Loading