ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

Prateek Yadav; Leshem Choshen; Colin Raffel; Mohit Bansal

ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal

Published: 23 Apr 2025, Last Modified: 23 Apr 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Parameter-efficient fine-tuning (PEFT) enables creation of specialized language models for diverse tasks, resulting in numerous expert modules. In many practical use cases, these expert PEFT modules are integrated into a single model that answers arbitrary queries by routing queries to different experts. However, only a few experts can be kept in GPU memory due to memory constraints. Consequently, expert modules are frequently loaded and offloaded between CPU/GPU memory or disk storage. This frequent swapping dramatically increases communication overhead, leading unacceptable latency and degrading user experience. The large size of modern PEFT modules further exacerbates this latency. For example, QLoRA experts for 65B LLaMA are 3.2GB, making swapping a major communication bottleneck, particularly in memory-constrained environments. To address these issues, we present ComPEFT (compressed PEFT), a novel method for compressing fine-tuning residuals (task vectors) of PEFT models. Reducing expert PEFT module size effectively addresses both memory and communication limitations, facilitating faster swapping and enabling a higher density of experts within a given memory footprint. ComPEFT employs sparsification and ternary quantization to reduce PEFT module size without any additional training while preserving or enhancing model performance. Extensive evaluation across T5, T0, and LLaMA-based models with 200M − 65B parameters, ComPEFT achieves compression ratios of 8x − 50x. Specifically, we show that ComPEFT improves with scale – stronger models exhibit higher compressibility and better performance. We show ComPEFT applied to LLaMA − 65B outperforms QLoRA by 4.16% on MMLU with a 26x storage size reduction. Additionally, compressed experts produced by ComPEFT maintain few-shot compositional generalization capabilities, facilitate efficient communication and computation, and exhibit enhanced performance when merged. Lastly, we provide an analysis of different method components, compare ComPEFT with other PEFT methods, and test its efficacy for compressing full finetuning residual.

Submission Length: Regular submission (no more than 12 pages of main content)

Supplementary Material: zip

Assigned Action Editor: ~Kangwook_Lee1

Submission Number: 4255

Loading