Keywords: Low-bit, FP4, Training, Efficiency, Scaling
Abstract: Training large language models (LLMs) at 4-bit precision offers substantial efficiency gains but remains challenging due to the limited dynamic range and coarse numerical resolution. Existing 4-bit training pipelines typically rely on max-scaling, which is ill-suited for heavy-tailed LLM tensor distributions and leads to severe under-utilization of the FP4 quantization grid in the low-magnitude region. This effect causes pronounced \emph{representation collapse} and large rounding errors for the values that dominate LLM computation. In this work, we derive the theoretically optimal scaling for FP4 under heavy-tailed inputs, revealing why max-scaling is intrinsically suboptimal. Guided by this analysis, we propose \textbf{Half-S}, an efficient scaling strategy that bridges theory and practice by halving the max-based scale when appropriate and safely reducing to max-scaling otherwise, achieving near-optimal scaling under real LLM statistics theoretically via a simple exponent shift. Extensive experiments on large-scale pretraining and downstream fine-tuning show that Half-S matches BF16 convergence and final model quality while delivering up to \textbf{1.8$\times$} end-to-end training speedup. These results establish Half-S as a minimal yet fundamental correction that, for the first time, enables practical and near-lossless 4-bit LLM training.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1403
Loading