Keywords: Optimization, Momentum, Mini-batch
Abstract: During foundation model training, mini-batch stochastic gradient descent alleviates memory constraints; however, the resulting increase in gradient variance induces sharp oscillations in the loss curve, slowing convergence. Conventional momentum algorithms overlook the limitation introduced by mini-batch training; their ideal assumption is that momentum propagates smoothly over time. Yet, in practice, momentum is almost restricted to gradients within a single epoch, so cross-epoch information is severely diminished and cannot continuously suppress oscillations. For the first time, we theoretically analyze the momentum degradation problem under mini-batch gradients. To address this, we propose \textbf{Cascaded Momentum}, which splits momentum into an \textbf{Inner momentum} that rapidly smooths mini-batch gradients within each epoch and an \textbf{Outer momentum} that accumulates historical gradient trends across epochs to provide inertial guidance to subsequent epochs. This two-level mechanism simultaneously attenuates noise and accelerates convergence with virtually no additional cost.
Primary Area: learning theory
Submission Number: 9614
Loading