eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models

Minsik Cho; Keivan Alizadeh-Vahid; Qichen Fu; Saurabh Adya; Carlo C del Mundo; Mohammad Rastegari; Devang Naik; Peter Zatloukal

eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models

Minsik Cho, Keivan Alizadeh-Vahid, Qichen Fu, Saurabh Adya, Carlo C del Mundo, Mohammad Rastegari, Devang Naik, Peter Zatloukal

Published: 03 Nov 2023, Last Modified: 03 Nov 2023SAGE 2023EveryoneRevisionsBibTeX

Keywords: large language model, compression

TL;DR: Existing differentiable weight clustering, representably, DKM (ICLR22) needs too much memory. We here show how to reduce such demand substantially

Abstract: Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) re- quires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight- clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is pro- hibitively significant for LLM fine-tuning. Especially, Differ- entiable KMeans Clustering, or DKM, has shown the state- of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this paper, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given ten- sor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and shard- ing after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3bit/weight) with the Al- paca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).

Submission Number: 1

Loading