Pruning Large Language Models to Intra-module Low-rank Structure with Transitional ActivationsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Structured pruning offers a viable approach to the local deployment of large language models (LLMs) by reducing computational and memory overheads. Compared to unstructured pruning and quantization, structured pruning has the advantage of being recoverable, since the pruned model remains dense and high-precision rather than sparse or low-precision. However, achieving a high compression ratio for scaled-up LLMs remains a challenge, as the coarse-grained structured pruning poses large damage to the highly interconnected model. In this paper, we introduce TransAct, a task-agnostic structured pruning approach coupled with a compact architecture design. TransAct reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules, while preserving the inter-module activations that are sensitive to perturbations. Hence, the LLM is compressed into an intra-module low-rank architecture, significantly reducing weights and KV Cache. TransAct is implemented on the Llama2 model and evaluated on downstream benchmarks. Results verify the optimality of our approach at high compression with respect to both speed and performance. Furthermore, ablation studies revealed the strength of iterative pruning and provides insights on the redundancy of MHA and MLP modules.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview