GPTVQ: The Blessing of Dimensionality for LLM Quantization

Mart Van Baalen; Andrey Kuzmin; Markus Nagel; Peter Couperus; Artem Bolshakov; Cedric Bastoul; Eric Mahurin; Tijmen Blankevoort; Paul Whatmough

GPTVQ: The Blessing of Dimensionality for LLM Quantization

Mart Van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Artem Bolshakov, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough

Published: 21 Jun 2024, Last Modified: 24 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM quantization, post-training quantization, vector quantization

TL;DR: A fast method for post-training quantization of LLMs using vector quantization

Abstract:

Large language models (LLMs) necessitate huge DRAM footprint and memory bandwidth costs, severely limiting deployment on mobile devices. This work demonstrates that non-uniform quantization in one or more dimensions can significantly ease this memory bottleneck. We provide analysis and experimental results to show that the model size versus accuracy trade-off of neural network quantization markedly improves when increasing the quantization dimensionality. To exploit this, we propose GPTVQ: an efficient method that extends GPTQ to non-uniform and vector quantization (VQ). GPTVQ establishes state-of-the-art results in model size vs accuracy across a wide range of LLMs, including Llama-v2/v3 and Mistral. Furthermore, our method is fast: on a single H100 it takes between 3 and 11 hours to process Llamav2-70B. Finally, we show that VQ is practical, by demonstrating simultaneous reduction in DRAM footprint and latency on a VQ quantized LLM on a mobile class Arm CPU, and a desktop Nvidia GPU.

Supplementary Material: zip

Submission Number: 37

Loading