Pre-training Large Language Models with Dynamic Precision: Low-Cost Computation with High-Fidelity Performance
Keywords: Mixed-precision training, dynamic precision strategy, efficient LLMs pre-training, Gradient norm.
TL;DR: This paper proposes a GNMR-driven dynamic method to stabilize low-precision training and narrow its performance gap with high-precision strategies.
Abstract: Mixed-precision training (MPT) is widely employed in large-scale deep learning tasks to balance computational efficiency and model performance. However, low-precision schemes, including 8-bit and 4-bit, face challenges such as architecture dependence, numerical spikes, and static strategies. To address these issues, this paper proposes the Gradient NorM Ratio (GNMR) metric-driven dynamic perception method. The GNMR index, along with its variant -GNMR index, enables real-time identification of high-risk operators and steps in low-precision training and supports high-precision recovery upon risk detection. We provide a theoretical analysis to validate the proposed dynamic precision strategy. Experimental results on various models demonstrate that the proposed method effectively improves the stability of mixed-precision training and narrows the performance gap between low-precision (4-bit, 8-bit) schemes and popular high-precision training strategies such as BF16.
Submission Number: 75
Loading