BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference

TMLR Paper5018 Authors

03 Jun 2025 (modified: 28 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-$8$-bits while maintaining activations at $8$-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps of block clustering and codebook design to greedily minimize the quantization mean squared error. When weight and activation scalars are encoded to W4A4 format (with $0.5$-bits of overhead for storing scaling factors and codebook selectors), we advance the current state-of-the-art by demonstrating $<1$\% loss in inference accuracy across several LLMs and downstream tasks.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Table 1: Lists various LO-BCQ configurations Table 9: Compares perplexity achieved by LO-BCQ codebooks with INT4, INT6 and INT8 entries on Llama2-7B model and Wikitext-103 datatset. Section 2.1: Clarified bit-widths of per-block array scale-factors and codebook entries as per reviewer's suggestion. Section 2.4: Included a discussion on various LO-BCQ configurations and added a reference to our ablation study where we describe in detail the impact of parameter section such as block-size, number of codebooks, block array size and bitwidth of codebook entries. Section 4.3: Included a discussion on the choice of bitwidth of LO-BCQ codebook entries.
Assigned Action Editor: ~Yunhe_Wang1
Submission Number: 5018
Loading