Keywords: large language model, compression
TL;DR: Existing differentiable weight clustering, representably, DKM (ICLR22) needs too much memory. We here show how to reduce such demand substantially
Abstract: Since Large Language Models or LLMs have demonstrated
high-quality performance on many complex language tasks,
there is a great interest in bringing these LLMs to mobile
devices for faster responses and better privacy protection.
However, the size of LLMs (i.e., billions of parameters) re-
quires highly effective compression to fit into storage-limited
devices. Among many compression techniques, weight-
clustering, a form of non-linear quantization, is one of the
leading candidates for LLM compression, and supported
by modern smartphones. Yet, its training overhead is pro-
hibitively significant for LLM fine-tuning. Especially, Differ-
entiable KMeans Clustering, or DKM, has shown the state-
of-the-art trade-off between compression ratio and accuracy
regression, but its large memory complexity makes it nearly
impossible to apply to train-time LLM compression. In this
paper, we propose a memory-efficient DKM implementation,
eDKM powered by novel techniques to reduce the memory
footprint of DKM by orders of magnitudes. For a given ten-
sor to be saved on CPU for the backward pass of DKM, we
compressed the tensor by applying uniquification and shard-
ing after checking if there is no duplicated tensor previously
copied to CPU. Our experimental results demonstrate that
eDKM can fine-tune and compress a pretrained LLaMA 7B
model from 12.6 GB to 2.5 GB (3bit/weight) with the Al-
paca dataset by reducing the train-time memory footprint of
a decoder layer by 130×, while delivering good accuracy on
broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for
Winograde, and so on).
Submission Number: 1
Loading