Abstract: Large language models have demonstrated remarkable success across a wide range of domains, with supervised fine-tuning being widely adapted to make them more suitable for real-world scenarios. Given the diversity of downstream tasks and varying demands, efficiently deploying multiple full-parameter fine-tuned models presents a significant challenge. To address this, we analyze $\textit{Balanced Intermediate Dropout}$, a distribution-related phenomenon, whereby the matrix-computed intermediate results for the delta weight of each fine-tuned model have extremely small variance and min-max range. Leveraging this phenomenon, we propose a novel distribution-driven delta compression framework DeltaDQ, which employs $\textit{Group-wise Balanced Dropout}$ and $\textit{Delta Quantization}$ to efficiently compress the delta weight. $\textit{Group-wise Balanced Dropout}$ achieves a favorable trade-off with accuracy and performance, ensuring an N:M sparsity pattern. $\textit{Delta Quantization}$ further compresses the delta weight based on distribution characteristics. Experimental results show that the accuracy of our framework on WizardMath-7B,13B at 96.875% compress rate is improved by 4.47 and 4.70 compared with baseline, and we even improve the accuracy by 1.83 and 0.61 compared with the original model on WizardCoder-13B,34B.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
0 Replies
Loading