A Simple Linear Patch Revives Layer-Pruned Large Language Models

ACL ARR 2025 February Submission8132 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Layer pruning has become a popular technique for compressing large language models (LLMs) due to its simplicity. However, existing layer pruning methods often suffer from significant performance drops. We identify that \textit{this degradation stems from the mismatch of activation magnitudes across layers and tokens at the pruning interface}. To address this, we propose \textsc{LinearPatch}, a simple plug-and-play technique to revive the layer-pruned LLMs. Our method introduces Hadamard transformation to suppress massive outliers over particular tokens, and channel-wise scaling to align the activation magnitudes. These operations can be ingeniously fused into a real symmetric matrix using the spectral theorem, i.e. the proposed \textsc{LinearPatch} at the pruning interface with negligible inference overhead. Our experiments demonstrate that \textsc{LinearPatch} retains up to \textbf{94.15\%} performance of the original model when pruning 5 layers of LLaMA-3-8B on the question answering benchmark, surpassing existing state-of-the-art by \textbf{4\%}. Additionally, with the proposed offline knowledge distillation using only 5K samples, \textsc{LinearPatch} can be further boosted to \textbf{95.16\%} within 30 minutes on a single computing card. Code will be released upon acceptance.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: pruning,NLP in resource-constrained settings
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 8132
Loading