Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

Published: 26 Jan 2026, Last Modified: 02 May 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Compression, Pruning, Reasoning, Chain-of-Thought
TL;DR: Standard pruning causes large accuracy losses in reasoning LLMs. Reconstructing chain-of-thought traces during calibration preserves performance, maintaining ~95% accuracy even at 50% sparsity.
Abstract: Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model’s on-policy chain-of-thought traces. This “Reasoning-Aware Compression” (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Anonymized code can be found at: https://github.com/RyanLucas3/Reasoning-Aware-Compression
Primary Area: optimization
Submission Number: 11449
Loading