MXNorm: Reusing block scales for efficient tensor normalisation

Published: 30 Oct 2025, Last Modified: 04 Nov 2025MLForSys2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: efficient pretraining, pretraining, quantization, mxfp, normalization, llm
TL;DR: We approximate the RMS of a tensor from the MX block scales to normalise tensors during pretraining.
Abstract: The matrix multiplications which comprise the bulk of computation in deep learning are being performed in increasingly narrow-precision formats. For example, next generation AI accelerators support dot products in MXFP4, a format requiring only 4.25 bits per element. However, accelerator performance for low-precision matrix multiplication far outstrips their performance on reductions and elementwise computations that are still being performed in higher precision. In this work, we reduce the cost of normalisation tensors by approximating the RMSNorm of an MXFP tensor using only the MX block scales, thereby enabling a 32x decrease in the size of reductions needed for normalisation. We validate our approximation on pre-training of Llama-3 models of 250M and 1B parameters, finding minimal loss of training accuracy compared to a baseline using RMSNorm with MXFP8 matmuls.
Submission Number: 21
Loading