Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo III SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: quantization, training, floating-point formats
TL;DR: We present a training method aimed at native FP4 training and its efficient GPU implementation.
Abstract: Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy. NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants for efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce Quartet, a new approach for accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by it, we design an optimal technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to standard-precision and FP8 training.
Submission Number: 114
Loading