Towards Understanding Momentum Acceleration in River-Valley Loss Landscape

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: In the river-valley loss landscape, momentum can accelerate the optimization progress by enabling use of a large learning rate.
Abstract: Momentum is a critical and ubiquitous component of modern optimizers, while the role of momentum remains unclear beyond restricted settings, especially in optimization for large-scale neural networks. Recent studies suggest that the highly non-convex loss landscape for large language models exhibits certain “river-valley” structure: a low-loss manifold (the river) bordered by sharp, high-loss directions (the valley), where the essential optimization progress is determined primarily by the progress along the river in the long run. Motivated by this structure, in this work, we investigate the role of heavy-ball momentum in such an emerging setting. Specifically, we analyze gradient descent with heavy-ball momentum and show that compared to vanilla gradient descent, momentum can accelerate the progress along the river by enabling use of a substantially larger learning rate. In fact, momentum acts as a stabilizer in the presence of oscillations caused by an aggressive choice of learning rate, which the vanilla gradient descent cannot tolerate. We validate the insights with experiments on synthetic functions and language model training, offering practical guidance for tuning learning rate and momentum parameters.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 126
Loading