Implicit Regularization of AdaDelta

Published: 08 Dec 2024, Last Modified: 08 Dec 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We consider the AdaDelta adaptive optimization algorithm on locally Lipschitz, positively homogeneous, and o-minimally definable neural networks, with either the exponential or the logistic loss. We prove that, after achieving perfect training accuracy, the resulting adaptive gradient flows converge in direction to a Karush-Kuhn-Tucker point of the margin maximization problem, i.e. perform the same implicit regularization as the plain gradient flows. We also prove that the loss decreases to zero and the Euclidean norm of the parameters increases to infinity at the same rates as for the plain gradient flows. Moreover, we consider generalizations of AdaDelta where the exponential decay coefficients may vary with time and the numerical stability terms may be different across the parameters, and we obtain the same results provided the former do not approach 1 too quickly and the latter have isotropic quotients. Finally, we corroborate our theoretical results by numerical experiments on convolutional networks with MNIST and CIFAR-10 datasets.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/englert-m/adadelta-implicit-regularization
Supplementary Material: zip
Assigned Action Editor: ~Robert_M._Gower1
Submission Number: 3120
Loading