KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned Stochastic Optimization

Jonathan Mei; Alexander Moreno; Luke Walters

KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned Stochastic Optimization

Jonathan Mei, Alexander Moreno, Luke Walters

Published: 08 May 2023, Last Modified: 23 Mar 2025UAI 2023Readers: Everyone

Keywords: Second Order Optimization, Adaptive Gradient, Kronecker, Preconditioning, Online Learning

TL;DR: Second order optimizer that avoids matrix inversion

Abstract: Second order stochastic optimizers allow parameter update step size and direction to adapt to loss curvature, but have traditionally required too much memory and compute for deep learning. Recently, Shampoo [Gupta et al., 2018] introduced a Kronecker factored preconditioner to reduce these requirements: it is used for large deep models [Anil et al., 2020] and in production [Anil et al., 2022]. However, it takes inverse matrix roots of ill-conditioned matrices. This requires 64-bit precision, imposing strong hardware constraints. In this paper, we propose a novel factorization, Kronecker Approximation-Domination (KrAD). Using KrAD, we update a matrix that directly approximates the inverse empirical Fisher matrix (like full matrix AdaGrad), avoiding inversion and hence 64-bit precision. We then propose KrADagrad$^\star$, with similar computational costs to Shampoo and the same regret. Synthetic ill-conditioned experiments show improved performance over Shampoo for 32-bit precision, while for several real datasets we have comparable or better generalization.

Supplementary Material: pdf

Other Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/kradagrad-kronecker-approximation-domination/code)

0 Replies

Loading