Tight High-Probability Bounds for Nonconvex Heavy-Tailed Scenario under Weaker Assumptions

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: weaker assumptions, heavy-tailed noise, (L_0, L_1)-smoothness, optimization, generalization
Abstract: Gradient clipping is increasingly important in centralized learning (CL) and federated learning (FL). Many works focus on its optimization properties under strong assumptions involving Gaussian noise and standard smoothness. However, practical machine learning tasks often only satisfy weaker conditions, such as heavy-tailed noise and $(L_0, L_1)$-smoothness. To bridge this gap, we propose a high-probability analysis for clipped Stochastic Gradient Descent (SGD) under these weaker assumptions. Our findings show a better convergence rate than existing ones can be achieved, and our high-probability analysis does not rely on the bounded gradient assumption. Moreover, we extend our analysis to FL, where a gap remains between expected and high-probability convergence, which the naive clipped SGD cannot bridge. Thus, we design a new \underline{Fed}erated \underline{C}lipped \underline{B}atched \underline{G}radient (FedCBG) algorithm, and prove the convergence and generalization bounds with high probability for the first time. Our analysis reveals the trade-offs between the optimization and generalization performance. Extensive experiments demonstrate that \methodname{} can generalize better to unseen client distributions than state-of-the-art baselines.
Supplementary Material: zip
Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)
Submission Number: 16064
Loading