Keywords: Large language models; Network Pruning
TL;DR: We present Sparsity-Aware Prompt Tuning, an efficient fine-tuning method to enhance the performance of sparse large language models.
Abstract: Pruning has recently demonstrated promising results in alleviating the heavy parameter burden and computational cost of Large Language Models (LLMs). However, the missing of sparsity-friendly fine-tuning significantly limits the performance of high-sparsity LLMs. While LoRA serves as the most popular fine-tuning approach for dense LLMs, it is naturally incompatible with unstructured sparsity since the merging operation condenses the weight matrix, thereby eliminating the benefits of sparsity. In this paper, we introduce Sparsity-aware Prompt Tuning (SPT), a simple and effective fine-tuning approach specifically tailored for sparse LLMs. Instead of fine-tuning the remaining weights or adding extra adaptors, SPT aims to learn soft prompts to compensate for pruned LLMs, enabling them to generate more desired content. Pruning occurs gradually during fine-tuning, with the prompt length proportional to the sparsity ratio assigned to each layer. This gradual imposition of pruning allows the output deviation caused by pruning to be efficiently mitigated through sparsity-aware prompt tuning. Our experimental results demonstrate that SPT significantly enhances the performance of sparse LLMs across a wide array of model architectures, parameter sizes, and tasks, particularly at high sparsity ratios. For instance, fine-tuning an 80% sparse LLaMA-V2-13B produced by SparsGPT for just 2.5 hours, SPT improves the zero-shot performance from 47.39% to 55.27%, outperforming its LoRA baseline by 2.55%, while using only 6.5% of the trainable parameters compared to the latter. This will deliver a 3.14x end-to-end inference speed-up using the DeepSparse inference engine.
Submission Number: 69
Loading