Derandomized Online-to-Non-convex Conversion for Stochastic Weakly Convex Optimization

Derandomized Online-to-Non-convex Conversion for Stochastic Weakly Convex Optimization

ICLR 2026 Conference Submission14919 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Non-smooth optimization, Non-convex optimization, Stochastic gradient descent with momentum, online learning, neural networks

TL;DR: A derandomized O2NC method for stochastic weakly convex optimization with optimal complexity and competitive numerical performances

Abstract: Online-to-non-convex conversion (O2NC) is an online updates learning framework for producing Goldstein $(\delta,\epsilon)$-stationary points of non-smooth non-convex functions with optimal oracle complexity $\mathcal{O}(\delta^{-1} \epsilon^{-3})$. Subject to auxiliary \emph{random interpolation or scaling}, O2NC recapitulates the stochastic gradient descent with momentum (SGDM) algorithm popularly used for training neural networks. Randomization, however, introduces deviations from practical SGDM. So a natural question arises: Can we derandomize O2NC to achieve the same optimal guarantees while resembling SGDM? On the negative side, the general answer is \emph{no} due to the impossibility results of~\citet{jordan23deterministic}, showing that no dimension-free rate can be achieved by deterministic algorithms. On the positive side, as the primary contribution of the present work, we show that O2NC can be naturally derandomized for \emph{weakly convex} functions. Remarkably, our deterministic algorithm converges at an optimal rate as long as the weak convexity parameter is no larger than $\mathcal{O}(\delta^{-1}\epsilon^{-1})$. In other words, the stronger stationarity is expected, the higher non-convexity can be tolerated by our optimizer. Additionally, we develop a periodically restarted variant of our method to allow for more progressive update when the iterates are far from stationary. The resulting algorithm, which corresponds to a momentum-restarted version of SGDM, has been empirically shown to be effective and efficient for training ResNet and ViT networks.

Primary Area: optimization

Submission Number: 14919

Loading