Generalization of noisy SGD in unbounded non-convex settings

Leello Tadesse Dadi; Volkan Cevher

Generalization of noisy SGD in unbounded non-convex settings

Leello Tadesse Dadi, Volkan Cevher

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We study generalization of iterative noisy gradient schemes on smooth non-convex losses. Formally, we establish time-independent information theoretic generalization bounds for Stochastic Gradient Langevin Dynamics (SGLD) that do not diverge as the iteration count increases. Our bounds are obtained through a stability argument: we analyze the difference between two SGLD sequences ran in parallel on two datasets sampled from the same distribution. Our result only requires an isoperimetric inequality to hold, which is merely a restriction on the tails of the loss. Our work relaxes the assumptions of prior work to establish that the iterates stay within a bounded KL divergence from each other. Under an additional dissipativity assumption, we show that the stronger Renyi divergence also stays bounded by establishing a uniform log-Sobolev constant of the iterates. Without dissipativity, we sidestep the need for local log-Sobolev inequalities and instead exploit the regularizing properties of Gaussian convolution. These techniques allow us to show that strong convexity is not necessary for finite stability bounds. Our work shows that noisy SGD can have finite, iteration-independent, generalization and differential privacy bounds in unbounded non-convex settings.

Lay Summary: Machine learning models are trained by updating the model weights a very large number of times. In principle, having this many steps increases the chances of learning spurious patterns, however in practice, models trained for thousands of iterations appear to still perform well. In this work, we show that this is not only observed in practice but also can be shown theoretically. We show this by modeling a training step as a noisy weight update. We show that this presence of noise ensures that only a limited amount of spurious patterns are picked up. Our result holds for a simplified model of the noise in the weight update and improves previous results in this space.

Primary Area: Theory->Optimization

Keywords: Information theoretic generalization, Langevin, SGD, differential privacy

Submission Number: 16148

Loading