TL;DR: SKIM introduces an effective post-training quantization technique for large language models.
Abstract: Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multi-GPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called \textbf{SKIM}: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A \textit{greedy algorithm} to solve approximately optimal bit allocation across weight channels, and 2. A \textit{trainable scaling vector} for non-differentiable K-means clustering. These techniques substantially improve the model performance and can be adapted to any given bit. Notably, in terms of perplexity, our method narrows the gap between quantized LLaMA models and their full precision counterparts by around \textbf{14\%} on average.
Lay Summary: Large language models (LLMs) like ChatGPT are powerful but require enormous computing resources, making them costly and impractical for everyday devices. Current methods to shrink these models either sacrifice accuracy at certain level or only work at specific precisions, limiting flexibility.
To solve this, we developed SKIM—a new quantization technique for model compression. SKIM smartly allocates different precision levels for different parts of the model using a greedy algorithm, similar to managing a tight budget. Additionally, a trainable scaling vector, another innovative component, functions as a value thermostat, smoothing the data and adjusting the compression computation to preserve accuracy. Together, these innovations allow SKIM to compress LLMs to any bit size (even fractions like 3.2 bits) while minimizing performance loss.
This advance narrows the accuracy gap with full-size models by 14% on average, enabling efficient deployment under resources-constraint scenario. By making powerful AI more accessible, SKIM helps democratize language technology without high costs.
Primary Area: General Machine Learning->Scalable Algorithms
Keywords: model quantization, large language model, efficient inference
Submission Number: 9211
Loading