Keywords: Model efficiency, Model Compression
Abstract: Post-training quantization (PTQ) has emerged as a promising solution for reducing the memory and computation overhead of large language models (LLMs), enabling efficient deployment without requiring full model retraining. However, existing PTQ methods struggle with weight–activation joint quantization and extreme weight quantization. The main challenge stems from the depth and cross-layer dependencies of LLMs, which cause quantization errors to propagate and accumulate across layers, leading to degraded performance. In this paper, we present I$^2$BQ, a simple yet effective framework that simultaneously addresses weight–activation joint quantization and extreme weight quantization. We first propose a granular quantization strategy that treats self-attention and feed-forward (FFN) modules as separate quantization units with module-specific optimization objectives. To mitigate inter-layer error accumulation, we introduce an inter-block quantization strategy that explicitly accounts for cross-layer dependencies by encouraging consistency between blocks. Extensive experiments across diverse LLMs, including OPT and the LLaMA family, demonstrate that I$^2$BQ achieves superior performance under both W4A4 and highly aggressive W2 settings, while incurring negligible additional computational overhead.
Primary Area: optimization
Submission Number: 9613
Loading