Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Shuffling-type gradient methods are favored in practice for their simplicity and rapid empirical performance. Despite extensive development of convergence guarantees under various assumptions in recent years, most require the Lipschitz smoothness condition, which is often not met in common machine learning models. We highlight this issue with specific counterexamples. To address this gap, we revisit the convergence rates of shuffling-type gradient methods without assuming Lipschitz smoothness. Using our stepsize strategy, the shuffling-type gradient algorithm not only converges under weaker assumptions but also match the current best-known convergence rates, thereby broadening its applicability. We prove the convergence rates for nonconvex, strongly convex, and non-strongly convex cases, each under both random reshuffling and arbitrary shuffling schemes, under a general bounded variance condition. Numerical experiments further validate the performance of our shuffling-type gradient algorithm, underscoring its practical efficacy.
Lay Summary: Many AI engineers speed up learning by shuffling the order of training examples, yet proofs of its reliability have existed only for the ideal case where the loss surface is perfectly smooth. Modern models—from image classifiers to language models—violate that tidy assumption, belonging instead to a broader “generalized smoothness” family whose bumps and plateaus were not covered by earlier theory. Our study shows that shuffling‑type gradient methods still converge rapidly under this more permissive generalized‑smoothness condition. We supply tight rate guarantees for both easy (convex) and hard (non‑convex) objectives and present a simple formula for picking step sizes that keeps training on track in these rougher landscapes. By extending the safety net around a technique practitioners already trust, our results let developers use shuffling with mathematical confidence on today’s messier problems and point to step‑size schedules that can make training even faster in practice.
Link To Code: https://github.com/heqi0511/Revisiting-Convergence-Shuffling-Complexity-Beyond-Lipschitz-Smoothness.git
Primary Area: Optimization
Keywords: shuffling-type gradient methods, convergence analysis, relaxed smoothness assumptions
Submission Number: 11028
Loading