INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Mengzhao Chen; Meng Wu; Hui Jin; Zhihang Yuan; Jing Liu; Chaoyi Zhang; Yunshui Li; Jie Huang; Jin Ma; Zeyue Xue; Zhiheng Liu; Xingyan Bin; Ping Luo

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo

19 Sept 2025 (modified: 23 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Low-Precision Quantization; INT v.s. FP; Microscaling format;

TL;DR: This paper demonstrates that integer (INT) quantization surpasses floating-point (FP) for LLMs in fine-grained quantization, challenging the current hardware trend.

Abstract: Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, INT consistently surpasses it as the quantization block size shrinks. Our comprehensive comparison demonstrates that for popular fine-grained formats like MX (block size 32), MXINT8 and MXINT4 are superior to their FP counterparts in both algorithmic accuracy and hardware efficiency. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory and advocate for prioritizing fine-grained INT formats in future AI accelerators to achieve a better balance of accuracy, power, and efficiency.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 17853

Loading