The Convergence Rate of SGD's Final Iterate: Analysis on Dimension DependenceDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Abstract: Stochastic Gradient Descent (SGD) is among the simplest and most popular optimization and machine learning methods. Running SGD with a fixed step size and outputting the final iteration is an ideal strategy one can hope for, but it is still not well-understood even though SGD has been studied extensively for over 70 years. Given the $\Theta(\log T)$ gap between current upper and lower bounds for running SGD for $T$ steps, it was then asked by [Koren and Segal COLT 20'] how to characterize the final-iterate convergence of SGD with a fixed step size in the constant dimension setting, i.e., $d=O(1)$. In this paper, we consider the more general setting for any $d\leq T$, proving $\Omega(\log d/\sqrt{T})$ lower bounds for the sub-optimality of the final iterate of SGD in minimizing non-smooth Lipschitz convex functions with standard step sizes. Our results provide the first general dimension-dependent lower bound on the convergence of SGD's final iterate, partially resolving the COLT open question raised by [Koren and Segal COLT 20']. Moreover, we present a new method in one dimension based on martingale and Freedman’s inequality, which gets the tight $O(1/\sqrt{T})$ upper bound with mild assumptions and recovers the same bounds $O(\log T/\sqrt{T})$ as previous best results under the standard assumptions.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Optimization (eg, convex and non-convex optimization)
11 Replies

Loading