Momentum Acceleration of Normalized Steepest Descent at the Edge of Stability

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: normalized steepest descent, momentum, edge of stability, theory of optimization
TL;DR: We discover a novel mechanism of momentum acceleration of NSD in full-batch oscillatory dynamics that is previously unknown.
Abstract: Optimizers based on normalized steepest descent (NSD) with momentum have seen growing success in training large-scale language models. Despite their widespread empirical adoption, the role of momentum in the full-batch regime remains unclear. In this paper, we identify a novel mechanism by which momentum accelerates NSD in the oscillatory training regime where the loss value does not decrease monotonically. Specifically, momentum suppresses the oscillatory component in the momentum buffer, so the stable progress direction becomes dominant in the unit-norm update. Theoretically, we provide a rigorous justification for this mechanism using a two-dimensional quadratic objective that captures the essential features of the oscillatory dynamics. Our analysis also extends to the sign-based variant of NSD, where momentum is provably essential for making progress in the oscillatory regime. Empirically, we validate the theory with full-batch training of an MLP network, where momentum significantly improves the final loss and delays the onset of the Edge of Stability.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 205
Loading