Generalization as Dynamical Robustness--The Role of Riemannian Contraction in Supervised Learning

Leo Kozachkov; Patrick Wensing; Jean-Jacques Slotine

Generalization as Dynamical Robustness--The Role of Riemannian Contraction in Supervised Learning

Leo Kozachkov, Patrick Wensing, Jean-Jacques Slotine

Published: 24 Apr 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: A key property of successful learning algorithms is generalization. In classical supervised learning, generalization can be achieved by ensuring that the empirical error converges to the expected error as the number of training samples goes to infinity. Within this classical setting, we analyze the generalization properties of iterative optimizers such as stochastic gradient descent and natural gradient flow through the lens of dynamical systems and control theory. Specifically, we use contraction analysis to show that generalization and dynamical robustness are intimately related through the notion of algorithmic stability. In particular, we prove that Riemannian contraction in a supervised learning setting implies generalization. We show that if a learning algorithm is contracting in some Riemannian metric with rate $\lambda > 0$, it is uniformly algorithmically stable with rate $\mathcal{O}(1/\lambda n)$, where $n$ is the number of examples in the training set. The results hold for stochastic and deterministic optimization, in both continuous and discrete-time, for convex and non-convex loss surfaces. The associated generalization bounds reduce to well-known results in the particular case of gradient descent over convex or strongly convex loss surfaces. They can be shown to be optimal in certain linear settings, such as kernel ridge regression under gradient flow. Finally, we demonstrate that the well-known Polyak-Lojasiewicz condition is intimately related to the contraction of a model's outputs as they evolve under gradient descent. This correspondence allows us to derive uniform algorithmic stability bounds for nonlinear function classes such as wide neural networks.

Submission Length: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Joan_Bruna1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 563

Loading