Keywords: Data-free model compression, Redundant attention layers
TL;DR: In LLMs, attention layers in the later stages of the network tend to be redundant and we come up with a data-free way of removing these layers.
Abstract: Many self-attention sublayers in large language models (LLMs) can be removed with little to no loss. We attribute this to the Attention Suppression Hypothesis: during pre-training, some deep attention layers learn to mute their own contribution, leaving the residual stream and the MLP to carry the representation.
We propose Gate-Norm, a one-shot, weight-only criterion that ranks attention sublayers by query--key coupling and removes the least coupled ones—requiring no calibration data, no forward passes, no fine-tuning, and no specialized kernels. On 40-layer, 13B-parameter LLaMA models, Gate-Norm prunes the model under a second. Pruning $8$–$16$ attention sublayers yields up to $1.30\times$ higher inference throughput while keeping average zero-shot accuracy within $2\%$ of the unpruned baseline across BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy/Challenge, and OpenBookQA.
Across these settings, Gate-Norm matches data-driven pruning methods in accuracy while being $\sim\ 1000\times$ faster to score layers, enabling practical, data-free compression of LLMs.
Primary Area: optimization
Submission Number: 5233
Loading