I$^2$BQ: Quantizing LLMs via Intra- and Inter-Block Optimization

Hailing Wang; Jianglin Lu; Yitian Zhang; Yun Fu

I$^2$BQ: Quantizing LLMs via Intra- and Inter-Block Optimization

Hailing Wang, Jianglin Lu, Yitian Zhang, Yun Fu

17 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model efficiency, Model Compression

Abstract: Post-training quantization (PTQ) has emerged as a promising solution for reducing the memory and computation overhead of large language models (LLMs), enabling efficient deployment without requiring full model retraining. However, existing PTQ methods struggle with weight–activation joint quantization and extreme weight quantization. The main challenge stems from the depth and cross-layer dependencies of LLMs, which cause quantization errors to propagate and accumulate across layers, leading to degraded performance. In this paper, we present I$^2$BQ, a simple yet effective framework that simultaneously addresses weight–activation joint quantization and extreme weight quantization. We first propose a granular quantization strategy that treats self-attention and feed-forward (FFN) modules as separate quantization units with module-specific optimization objectives. To mitigate inter-layer error accumulation, we introduce an inter-block quantization strategy that explicitly accounts for cross-layer dependencies by encouraging consistency between blocks. Extensive experiments across diverse LLMs, including OPT and the LLaMA family, demonstrate that I$^2$BQ achieves superior performance under both W4A4 and highly aggressive W2 settings, while incurring negligible additional computational overhead.

Primary Area: optimization

Submission Number: 9613

Loading