Rethink Mini-batch Gradient: Cascade Momentum

Zhou Leng; Boyuan Li; Shengbo Chen; Yafei Li; Zijian Li; Hong Rao

Rethink Mini-batch Gradient: Cascade Momentum

Zhou Leng, Boyuan Li, Shengbo Chen, Yafei Li, Zijian Li, Hong Rao

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Optimization, Momentum, Mini-batch

Abstract: During foundation model training, mini-batch stochastic gradient descent alleviates memory constraints; however, the resulting increase in gradient variance induces sharp oscillations in the loss curve, slowing convergence. Conventional momentum algorithms overlook the limitation introduced by mini-batch training; their ideal assumption is that momentum propagates smoothly over time. Yet, in practice, momentum is almost restricted to gradients within a single epoch, so cross-epoch information is severely diminished and cannot continuously suppress oscillations. For the first time, we theoretically analyze the momentum degradation problem under mini-batch gradients. To address this, we propose \textbf{Cascaded Momentum}, which splits momentum into an \textbf{Inner momentum} that rapidly smooths mini-batch gradients within each epoch and an \textbf{Outer momentum} that accumulates historical gradient trends across epochs to provide inertial guidance to subsequent epochs. This two-level mechanism simultaneously attenuates noise and accelerates convergence with virtually no additional cost.

Primary Area: learning theory

Submission Number: 9614

Loading