TL;DR: We present any4, a 4-bit weight quantization solution that enables learned arbitrary numeric representation without requiring pre-processing of weights or activations.
Abstract: We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4.
Lay Summary: Large language models (LLMs), like ChatGPT or Llama, are powerful but very large—they have billions of numbers (called weights) that take up a lot of memory and make them slow and expensive to run. To speed them up and allow them to run on smaller devices, we need to make these numbers smaller in size without hurting performance.
One common technique to do this is quantization, which means representing each number using fewer bits (the basic unit of data in computers). Normally, LLMs use 16 bits per weight. In this work, we focus on using only 4 bits per weight, which makes models much smaller and faster.
The challenge with 4-bit quantization is keeping the model accurate. Our method, any4, learns the best way to represent these numbers, giving each row of weights its own custom mapping. This makes any4 more accurate than other 4-bit formats like int4, fp4, and nf4, and even competitive with more complex methods that need a lot of extra steps (like GPTQ and AWQ).
We also tried using 3-bit and 2-bit versions (any3 and any2) and found them competitive. Uniquely, our method needs only a single well-chosen example to calibrate and tune the best mapping for each row of weights, while other quantization methods tend to require dozens or hundreds.
Finally, we built and open-sourced tinygemm, a fast GPU library that runs any4 and other quantized models efficiently. Our code is freely available at: https://github.com/facebookresearch/any4.
Link To Code: https://github.com/facebookresearch/any4
Primary Area: Deep Learning->Large Language Models
Keywords: Quantization, Compression, Acceleration
Submission Number: 4181
Loading