Abstract: We study the convergence of the shuffling gradient method, a popular algorithm employed to minimize the finite-sum function with regularization, in which functions are passed to apply (Proximal) Gradient Descent (GD) one by one whose order is determined by a permutation on the indices of functions. In contrast to its easy implementation and effective performance in practice, the theoretical understanding remains limited. A recent advance by (Liu & Zhou, 2024b) establishes the first last-iterate convergence results under various settings, especially proving the optimal rates for smooth (strongly) convex optimization. However, their bounds for nonsmooth (strongly) convex functions are only as fast as Proximal GD. In this work, we provide the first improved last-iterate analysis for the nonsmooth case demonstrating that the widely used Random Reshuffle ($\textsf{RR}$) and Single Shuffle ($\textsf{SS}$) strategies are both provably faster than Proximal GD, reflecting the benefit of randomness. As an important implication, we give the first (nearly) optimal convergence result for the suffix average under the $\textsf{RR}$ sampling scheme in the general convex case, matching the lower bound shown by (Koren et al., 2022).
Lay Summary: Shuffling gradient methods are widely implemented in practice but with fewer theoretical convergence guarantees. Especially, for nonsmooth convex problems, whether the last iterate of shuffling gradient methods outperforms Proximal Gradient Descent (GD) remains unclear.
This work addresses this question by proving that the last-iterate convergence rates of two popular shuffling strategies, Random Reshuffle ($\textsf{RR}$) and Single Shuffle ($\textsf{SS}$), are both faster than Proximal GD (conditionally for $\textsf{SS}$). Remarkably, our analysis builds upon a more general framework not limited to shuffling gradient methods and results in a new sufficient condition for the last-iterate convergence of first-order methods with a general form.
These new results demonstrate the benefit of randomness in $\textsf{RR}$ and $\textsf{SS}$ as it indeed boosts better convergence.
Primary Area: Optimization->Convex
Keywords: Convex optimizaton, Nonsmooth optimization, Shuffling methods
Submission Number: 14232
Loading