Keywords: quantization, llms, trellises, fast inference, post training quantization, trellis coded quantization, model compression, computed codes
TL;DR: We present the first tractable ultra-high dimensional quantizer for LLM PTQ that supports fast inference, enabling in state of the art quantization quality and inference speed.
Abstract: Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes.
Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput.
Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping.
However, VQ requires a codebook with size exponential in the dimension.
This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality.
Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization.
TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension.
QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.
Primary Area: Other (please use sparingly, only use the keyword field for more details)
Submission Number: 5532
Loading