Introducing Accurate 4-Bit Quantization with Hyperspherical Architecture

01 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, quantization, hyperspherical transformer
Abstract: Due to the hardware support from NVIDIA's Blackwell architecture, 4-bit quantization of large language models promises substantial memory and throughput gains. However, naive 4-bit quantization degrades accuracy and remains challenging in practice. In this work, we revisit the root causes of this degradation and posit a new perspective through analysis of matrix multiplication and the unbounded weight within models. We show that quantization induces errors that are amplified within the attention and MLP submodules, leading to unstable error growth across layers. From this analysis, we propose architectural co-designs that use hyperspherical transformers to jointly normalize activations and constrain weights to unit norm, converting dot-products into bounded cosine similarities and suppressing error expansion. On 0.5–1B models, pretrained hyperspherical models yield new state-of-the-art performance to 4-bit weight-activation quantization, outperforming standard transformer architecture and a strong QAT baseline, while a partial normalization plug-in narrows the degradation gap in existing models. These results position model architectural co-design as a third optimization axis, complementary to existing works, for robust low-bit LLM deployment.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 611
Loading