Adaptive Gradient Normalization and Independent Sampling for (Stochastic) Generalized-Smooth Optimization

TMLR Paper4210 Authors

14 Feb 2025 (modified: 21 Apr 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent studies have shown that many nonconvex machine learning problems satisfy a generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms are not fully adapted to such generalized-smooth nonconvex geometry and encounter significant technical limitations on their convergence analysis. In this work, we first analyze the convergence of adaptively normalized gradient descent under function geometries characterized by generalized-smoothness and generalized PL condition, revealing the advantage of adaptive gradient normalization. Our results provide theoretical insights into adaptive normalization across various scenarios. For stochastic generalized-smooth nonconvex optimization, we propose Independent-Adaptively Normalized Stochastic Gradient Descent, which leverages adaptive gradient normalization, independent sampling, and gradient clipping to achieve an $\mathcal{O}(\epsilon^{-4})$ sample complexity under relaxed noise assumptions. Experiments on large-scale nonconvex generalized-smooth problems demonstrate the fast convergence of our algorithm.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: In this revision we first correct the proof errors in Theorem 1 (case III) and Lemma 2. Theorem 1 is restated so that its convergence rate is expressed in terms of $T$, matching the notation of Theorem 2. After establishing a new Lemma 2, we improve Theorem 2 by removing the batch-size dependence on $\tau_{1}$. We also add the key descent inequality for IAN-SGD under the generalized-smooth condition and polish every proof in the appendix to enhance clarity and correctness. In response to Reviewer IJCU, we add more ablation study. By varying $\tau_{1}$ and $\tau_{2}$, the first ablation study reveals their combined influence on $\delta$, convergence speed and stability. The second ablation study further compares IAN-SGD with other baselines using small batch size (see Appendix A.3.1 and A.4). In main text, we revise the related work section, which now states the precise assumptions made in prior studies. After Theorems 1 and 2, we discuss in detail how our results differ from previous work. At the end of experiments, we summarize the main experimental findings on hyper-parameter settings. We also revise sentences that reviewers had identified as ambiguous and typos.
Assigned Action Editor: ~Sebastian_U_Stich1
Submission Number: 4210
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview