The Case for Full-Matrix Adaptive Regularization

Naman Agarwal; Brian Bullins; Xinyi Chen; Elad Hazan; Karan Singh; Cyril Zhang; Yi Zhang

The Case for Full-Matrix Adaptive Regularization

Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, Yi Zhang

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: Adaptive regularization methods pre-multiply a descent direction by a preconditioning matrix. Due to the large number of parameters of machine learning problems, full-matrix preconditioning methods are prohibitively expensive. We show how to modify full-matrix adaptive regularization in order to make it practical and effective. We also provide novel theoretical analysis for adaptive regularization in non-convex optimization settings. The core of our algorithm, termed GGT, consists of efficient inverse computation of square roots of low-rank matrices. Our preliminary experiments underscore improved convergence rate of GGT across a variety of synthetic tasks and standard deep learning benchmarks.

Keywords: adaptive regularization, non-convex optimization

TL;DR: fast, truly scalable full-matrix AdaGrad/Adam, with theory for adaptive stochastic non-convex optimization

7 Replies

Loading