Implicit Regularisation in Overparametrized Networks: A Multiscale Analysis of the Fokker-Planck equation

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: overparametrized networks, optimisation, implicit regularization, multiscale, fokker-planck equation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: A derivation of the implicit regularisation drift in overparametrized networks via multiscale methods.
Abstract: In over-parametrised networks, a large continuous set of solutions (an invariant manifold) exists where the empirical loss is minimal. However, noise in the learning dynamics can introduce a drift along this manifold, biasing the dynamics towards solutions with higher ``smoothness'', therefore acting as a regularizer. In Li et al. (2022), a derivation of this drift was presented, borrowing the results from Katzenberger (1991), which shows that in the small learning-rate limit, $\eta \to 0$, the learning dynamics can be approximated by a stochastic differential equation (SDE), whose solution exhibit two distinct phases: a first phase, occurring over a number of steps $O(\eta^{-1})$, where the parameters are deterministically driven towards the invariant manifold; and a second phase, over timescales $O(\eta^{-2})$, in which noise induces a deterministic drift along the invariant manifold. This latter contribution to the drift, can be regarded as the result of averaging the dynamics over the $O(\eta^{1/2})$ fluctuations orthogonal to the manifold, described by an Ornstein--Uhlenbeck process, as first suggested by Blanc et al. (2020). We offer a new derivation of the results by Li et al. (2022), that builds on the very intuitive arguments by Blanc et al. (2020), by implementing the averaging of the Fokker-Planck equation associated with the $\eta \to 0$ dynamics over such Ornstein--Uhlenbeck quasi-stationary state. Our contribution demonstrates the application of multiscale methods for elliptic partial differential equations (PDEs) (Pavliotis and Stuart (2008)) to optimization problems in machine learning.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9200
Loading