Activation‑Aware Pruning of Large Language Models

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Model Pruning; Activation‑Aware Regularization; Large Language Models (LLMs); Sparsity; Efficient Inference
TL;DR: We propose a novel compression framework that rigorously accounts for the complete spectrum of changes spanning the pre‑ and post‑activation phases.
Abstract: Although large language models (LLMs) have performed well across various tasks since emergence, their application in many specific scenarios is hindered by limited computational resources. One-shot pruning mitigates this issue by removing redundant parameters from the weight matrix in a single training run. However, most existing approaches still depend on heuristic searches or linear approximations inherited from deep networks, thereby assigning equal importance to all weight matrices while overlooking the activation‑function modules in Transformer architectures—modules that alter the relative significance of weights before and after activation. In this paper, we propose a novel pruning method, Activation-Aware Pruning (AAP), which improves compression performance by explicitly capturing the shifts induced by activation. Beyond solely matching pre-activation outputs, AAP incorporates activation-aware regularization that preserves post-activation sign and pattern consistency, substantially reducing accuracy degradation at high sparsity levels. Moreover, we propose an approximate update rule based on an analytical approximation of the weight matrix, which requires no fine-tuning and is supported by theoretical guarantees. Applied to the open-source models, e.g., OPT model family and LLaMA series, our method achieves lower perplexities at different sparsities compared with prior approaches. The code will be released soon on GitHub.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 13003
Loading