Keywords: model compression, large language models, finetuned models.
TL;DR: We take the first step towards a paradigm shift for efficient compression of finetuned models by discovering that the difference between the models is much more compressible than the models themselves.
Abstract: Storage is increasingly a practical bottleneck to scaling large language model (LLM) systems with personalization, co-location, and other use cases that require storing the pretrained base model plus multiple finetuned models. To this end, we propose GPT-Zip for post-finetuning compression. GPT-Zip uses quantization and sparsification to efficiently compress finetuned models by exploiting their closeness to the pretrained base model. Specifically, we demonstrate that the \emph{difference} between the finetuned models and the pretrained base model can efficiently be quantized into $2$ bits and pruned with $95 \%$ sparsity together -- providing up to $52$ times overall size reduction. Thus, GPT-Zip avoids the linear growth in memory costs required for naive storage. We show that this compression can be achieved without performance degradation, as measured by evaluations on several tasks from the Natural Instructions dataset. Surprisingly, GPT-Zip sometimes improves accuracy over uncompressed models. We demonstrate the efficacy of GPT-Zip on four finetuned OPT-1.3B models and show that GPT-Zip reduces the storage cost by $16$ times more than existing LLM compression techniques while attaining significantly better performance.
Submission Number: 25
Loading