Pruning Large Language Models to Intra-module Low-rank Structure with Transitional Activations

Anonymous

Pruning Large Language Models to Intra-module Low-rank Structure with Transitional Activations

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Structured pruning offers a viable approach to the local deployment of large language models (LLMs) by reducing computational and memory overheads. Compared to unstructured pruning and quantization, structured pruning has the advantage of being recoverable, since the pruned model remains dense and high-precision rather than sparse or low-precision. However, achieving a high compression ratio for scaled-up LLMs remains a challenge, as the coarse-grained structured pruning poses large damage to the highly interconnected model. In this paper, we introduce TransAct, a task-agnostic structured pruning approach coupled with a compact architecture design. TransAct reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules, while preserving the inter-module activations that are sensitive to perturbations. Hence, the LLM is compressed into an intra-module low-rank architecture, significantly reducing weights and KV Cache. TransAct is implemented on the Llama2 model and evaluated on downstream benchmarks. Results verify the optimality of our approach at high compression with respect to both speed and performance. Furthermore, ablation studies revealed the strength of iterative pruning and provides insights on the redundancy of MHA and MLP modules.

Paper Type: long

Research Area: Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

0 Replies

Loading