Differentiable, Stable and Efficient Floating-Point Quantization

ICLR 2026 Conference Submission15379 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Quantization
Abstract: Finding optimal datatype for neural networks is a non-trivial problem with exponential search space. To solve the problem of quantization effectively, we consider pseudo-quantization training (PQT) on microscaling (MX) datatypes. Specifically, we propose pseudo-quantization noise (PQN) based on $R\approx\lfloor\mathcal N(0,1)/2\rceil$. It allows PQT to (1) optimize on the floating-point (FP) bit configuration, (2) help preserve dynamic range of original data, and (3) generate noise $R$ efficiently. We demonstrate that the proposed method allows for stable and efficient pre-training of the GPT2 and Llama2 language models up to 1 billion (B) parameters for up to 295B tokens, with insights on optimal FP datatypes for model parameters.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 15379
Loading