TL;DR: We introduce SLiM, a one-shot quantized sparse plus low-rank method for compressing LLMs without retraining, reducing memory and inference costs while improving the accuracy, offering efficient deployment in memory-constrained environments.
Abstract: Conventional model compression techniques for LLMs address high memory consumption and slow inference challenges but typically require computationally expensive retraining to preserve accuracy. In contrast, one-shot compression methods eliminate retraining cost, but struggle to achieve accuracy comparable to dense models. This paper presents SLIM, a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation into a unified process. First, we formulate the quantization process using a probabilistic approach (SLIM-Quant) that enables us to apply uniform quantization. Then, we use an existing one-shot pruning method to apply semi-structured sparsity on top of the quantized weights. Finally, to compensate for the introduced aggregated quantization and sparsity error, we use a novel saliency function with unique invertible and additive features that enables us to
mathematically compute the value of low-rank adapters. SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods. Models compressed with SLIM achieve up to 4.3× and 3.8× on Nvidia RTX3060 and A100 GPUs, respectively. Additionally, they achieve up to 0.23× end-to-end memory reduction in comparison to their dense counterparts. We also propose an optional PEFT recipe that further improves accuracy
by up to 1.66% (LLaMA-2-13B) compared to SLIM without fine-tuning.
Lay Summary: Large language models power many AI applications but often demand vast memory and compute resources, making them hard to run on everyday devices or at scale. To address this, we introduce SLiM, a “one-shot” compression method that reduces a model’s size without any expensive retraining by weaving together three contributions: uniform quantization, structured sparsity (pruning), and a low-rank adapter that mathematically corrects the compound errors from the first two steps. SLiM’s quantization step uses a probabilistic search to pick the best scaling factor; its pruning step applies a hardware-friendly pattern; and its low-rank adapter step uses a saliency measure to compute corrections in closed form. The result is an 8× smaller model that reduces the gap between its original accuracy and the compressed model's accuracy (up to 5.66% higher over state-of-the-art) and runs up to 4.3× faster and with up to 0.23× less memory on off-the-shelf GPUs. An optional, lightweight fine-tuning recipe can boost accuracy by another 1.66% with minimal overhead. By packaging SLiM as an easy-to-use tool with code publicly available, we aim to make advanced language models more accessible, energy-efficient, and ready for deployment in real-world environments.
Link To Code: https://github.com/Mohammad-Mozaffari/slim
Primary Area: Deep Learning
Keywords: sparsity, 2:4 sparsity, quantization, low-rank, lora
Submission Number: 12301
Loading