Reasoning Models Can Be Accurately Pruned via Chain-of-Thought Reconstruction

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning LLMs, Pruning, LLM Compression
TL;DR: Standard pruning causes large accuracy losses in reasoning LLMs. Reconstructing chain-of-thought traces during calibration preserves performance, maintaining ~95% accuracy even at 50% sparsity.
Abstract: Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. Model pruning is a model compression technique that aims to reduce the model size by removing some weights from the model, while maintaining the accuracy in order to obtain more efficient models. We show that using compression techniques such as standard pruning methods produce large accuracy drop after pruning. To mitigate this, we introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model’s own chain-of-thought traces via a layer-wise objective. This “Reasoning-Aware Compression” (RAC) integrates seamlessly into standard pruning methods such as SparseGPT, and boosts their performance significantly. We show that under this RAC approach, we can prune reasoning LLMs to 50\% sparsity, while maintaining up to 95\% of the original model's accuracy on math and coding tasks.
Submission Number: 36
Loading