LO-BCQ: Locally Optimal Block Clustered Quantization for 4-bit (W4A4) LLM Inference

Reena Elangovan; Charbel Sakr; Anand Raghunathan; Brucek Khailany

LO-BCQ: Locally Optimal Block Clustered Quantization for 4-bit (W4A4) LLM Inference

Reena Elangovan, Charbel Sakr, Anand Raghunathan, Brucek Khailany

Published: 17 Dec 2025, Last Modified: 17 Dec 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-$8$-bits while maintaining activations at $8$-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps of block clustering and codebook design to greedily minimize the quantization mean squared error. When weight and activation scalars are encoded to W4A4 format (with $0.5$-bits of overhead for storing scaling factors and codebook selectors), we advance the current state-of-the-art by demonstrating $<1$\% loss in inference accuracy across several LLMs and downstream tasks.

Certifications: Featured Certification, J2C Certification

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Yunhe_Wang1

Submission Number: 5018

Loading