Adaptive Gradient Normalization and Independent Sampling for (Stochastic) Generalized-Smooth Optimization

Yufeng Yang; Erin E. Tripp; Yifan Sun; Shaofeng Zou; Yi Zhou

Adaptive Gradient Normalization and Independent Sampling for (Stochastic) Generalized-Smooth Optimization

Yufeng Yang, Erin E. Tripp, Yifan Sun, Shaofeng Zou, Yi Zhou

Published: 28 Jul 2025, Last Modified: 28 Jul 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent studies have shown that many nonconvex machine learning problems satisfy a generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms are not fully adapted to such generalized-smooth nonconvex geometry and encounter significant technical limitations on their convergence analysis. In this work, we first analyze the convergence of adaptively normalized gradient descent under function geometries characterized by generalized-smoothness and the generalized PL condition, revealing the advantage of adaptive gradient normalization. Our results provide theoretical insights into adaptive normalization across various scenarios. For stochastic generalized-smooth nonconvex optimization, we propose the Independent-Adaptively Normalized Stochastic Gradient Descent algorithm, which leverages adaptive gradient normalization, independent sampling, and gradient clipping to achieve an $\mathcal{O}(\epsilon^{-4})$ sample complexity under relaxed noise assumptions. Experiments on large-scale nonconvex generalized-smooth problems demonstrate the fast convergence of our algorithm.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: In this revision, we have updated the manuscript in response to the comments from reviewer IJCU. For the first point raised, we added a brief comparison between our noise assumption and those used in previous works, and clarified the motivation behind adopting this assumption(see paragraph below Assumption 4). We also included an additional remark 3 summarizing the key insight into how a large batch size of $\Omega(\epsilon^{-2})$ can improve the convergence rate of Clipped SGD under the expected noise setting—a logic that similarly applies to IAN-SGD, as it also incorporates gradient clipping. For the second point, we revised the relevant descriptions in both the related work and main sections to address the reviewer’s concern. Finally, we carefully polished the manuscript by correcting notation inconsistencies and grammatical issues. We also added an acknowledgements section to recognize the reviewer’s constructive feedback and the external funding that supported this work, and changed the page header from MM/2025 to 07/2025.

Code: https://github.com/ynyang94/Gensmooth-IAN-SGD

Supplementary Material: zip

Assigned Action Editor: ~Sebastian_U_Stich1

Submission Number: 4210

Loading