Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference

Oshin Dutta; Ritvik Gupta; Sumeet Agarwal

Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference

Oshin Dutta, Ritvik Gupta, Sumeet Agarwal

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: TVA-prune, pruning, compression, LLM, LLaMA, Mistral, 7B, VIB, Fast, Inference, Compression Time, structured

TL;DR: A Time-efficient structured pruning method pruning token representations globally for all layers and adapting to GPU hardware for extremely fast inference.

Abstract: Structured pruning removes entire components like attention heads or layers to yield faster dense models. However, previous methods require significant pruning time and and overlook token embedding dimension, missing potential inference acceleration. Moreover, pruning of heads in grouped query attentions is not widely attempted due to challenges with their interdependencies. To address these limitations, we propose a structured pruning method for LLMs that incorporates the concept of Variational Information Bottleneck (VIB) to obtain compressed representations at each structural element while preserving essential information for accurate prediction. We enhance the formulation to account for all preceding layers' influence on the current compressed representation, ensuring a globally informed reduction in token embedding dimension and grouped query not explored in previous work. Additionally, we include a post-pruning step to adjust the pruned model dimensions, ensuring optimal use of Tensor Cores in GPUs which speeds up inference by upto 60%. Evaluated on several language benchmarks using variants of LLaMA models and Mistral, our method shows a reduction in pruning time by upto 90% with higher inference speed and competitive performance.

Submission Number: 77

Loading