Efficient LLM Pruning with Token-Dependency Awareness and Hardware-Adapted Inference

ACL ARR 2024 June Submission4742 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Structured pruning removes entire components, like attention heads or hidden dimensions to yield faster dense large language models. However, previous methods are time-consuming and inference speedup is bottlenecked by inefficient GPU parallel processing due to mismatch in pruned weight block dimensions with tensor cores. Moreover, pruning of heads in grouped query attentions is not widely attempted due to challenges with their interdependencies. To address these limitations, we propose (1) a structured pruning method for LLMs with grouped-query attention (GQA) that learn appropriate key, value and shared query heads to retain according to their importance for accurate prediction. (2) a post-pruning weight update to better retain the performance of pruned LLMs. (3) a post-pruning dimension adaptation step to enhance GPU utilization of pruned models and significantly speed up inference. Our method speeds up inference by up to 60% over previous approaches. Evaluated on several language benchmarks using variants of LLaMA models and Mistral, our method shows a reduction in pruning time by upto 90% with higher inference speed and performance over a range of sparsity ratios. Additionally, our findings suggest that pruning can alleviate prediction confusion in certain scenarios.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: pruning, distillation, data-efficient training, NLP in resource-constrained settings, probing
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 4742
Loading