S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

Published: 25 Sept 2024, Last Modified: 17 Jan 2025NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: PEFT; LLM Efficiency; LLM Post-training
Abstract: Current PEFT methods for LLMs can achieve high quality, efficient training, or scalable serving, but not all three simultaneously. To address this limitation, we investigate sparse fine-tuning and observe a remarkable improvement in generalization ability. Utilizing this key insight, we propose a family of Structured Sparse Fine-Tuning (S${^2}$FT) methods for LLMs, which concurrently achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability. S${^2}$FT accomplishes this by "selecting sparsely and computing densely". Based on the coupled structures in LLMs, \model selects a few attention heads and channels in the MHA and FFN modules for each Transformer block, respectively. Next, it co-permutes the weight matrices on both sides of all coupled structures to connect the selected subsets in each layer into a dense submatrix. Finally, S${^2}$FT performs in-place gradient updates on all selected submatrices. Through theoretical analyses and empirical results, our method prevents forgetting while simplifying optimization, delivers SOTA performance on both commonsense and arithmetic reasoning with 4.6% and 1.3% average improvements compared to LoRA, and surpasses full FT by 11.5% when generalizing to various domains after instruction tuning. Using our partial back-propagation algorithm, S${^2}$FT saves training memory up to 3$\times$ and improves latency by 1.5-2.7$\times$ compared to full FT, while achieving an average 10\% improvement over LoRA on both metrics. We further demonstrate that the weight updates in S${^2}$FT can be decoupled into adapters, enabling effective fusion, fast switch, and efficient parallelism when serving multiple fine-tuned models.
Primary Area: Generative models
Submission Number: 13213
Loading