Abstract: We consider the AdaDelta adaptive optimization algorithm on locally Lipschitz, positively homogeneous, and o-minimally definable neural networks, with either the exponential or the logistic loss. We prove that, after achieving perfect training accuracy, the resulting adaptive gradient flows converge in direction to a Karush-Kuhn-Tucker point of the margin maximization problem, i.e. perform the same implicit regularization as the plain gradient flows. We also prove that the loss decreases to zero and the Euclidean norm of the parameters increases to infinity at the same rates as for the plain gradient flows. Moreover, we consider generalizations of AdaDelta where the exponential decay coefficients may vary with time and the numerical stability terms may be different across the parameters, and we obtain the same results provided the former do not approach 1 too quickly and the latter have isotropic quotients. Finally, we corroborate our theoretical results by numerical experiments on convolutional networks with MNIST and CIFAR-10 datasets.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have revised the paper based on the reviewers' comments and questions, for which we are grateful. This involved expanding the Introduction, expanding our remarks on the Assumptions and the main Theorem, restructuring and expanding its proof, and expanding the Experiments section. We also performed more runs of the experiments, updated the plots, and provided the code in the supplementary materials. Please see our responses to each reviewer for further details.
Assigned Action Editor: ~Robert_M._Gower1
Submission Number: 3120
Loading