Why Do We Need Warm-up? A Theoretical Perspective

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Optimization, (H_0, H_1) smoothness, warm-up, explanation for warm-up
TL;DR: We propose a smoothness condition for explaining learning-rate warm-up. We show, theoretically and empirically, that it holds for standard architectures and approximates early-training smoothness. We also show convergence and empirical gains.
Abstract: Learning rate warm-up - increasing the step size at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 5931
Loading