Abstract: Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-$8$-bits while maintaining activations at $8$-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps of block clustering and codebook design to greedily minimize the quantization mean squared error. When weight and activation scalars are encoded to W4A4 format (with $0.5$-bits of overhead for storing scaling factors and codebook selectors), we advance the current state-of-the-art by demonstrating $<1$\% loss in inference accuracy across several LLMs and downstream tasks.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yunhe_Wang1
Submission Number: 5018
Loading