TL;DR: Kronecker-Factored preconditioning is gaining popularity as an Adam/SGD alternative. We provide concrete evidence of their usefulness by analyzing how they uniquely enhance feature learning.
Abstract: Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, *linear representation learning* and *single-index learning*, which are widely used to study how typical algorithms efficiently learn useful *features* to enable generalization.
In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work.
We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.
Lay Summary: Recently, a new family of optimization algorithms have shown great promise in making neural network training faster and more efficient in practice. These algorithms introduce new versions "preconditioning", which is the practice of "re-sizing" a problem to hopefully make it easier to find good solutions. The current standard optimizer, Adam, performs "preconditioning" independently on each parameter of a neural network, while these new algorithms do so in a new way that can also take into account dependencies between parameters within each layer of the network, hence "layer-wise preconditioning".
On the other hand, theory researchers have proposed various problems to understand very clearly how neural networks can find good solutions. These works typically study the most basic optimization algorithm in stochastic gradient descent (SGD). However, we found that SGD is somehow fundamentally limited: when the data is not perfectly "well-conditioned" (imagine some coordinates of the data are larger than others), then these positive results about neural network training no longer hold, in theory or in practice.
In finding ways to adjust SGD to work for general types of data, we found that the resulting algorithm aligns with these (practical) "layer-wise preconditioning" algorithms. This has implications both for theorists, where these results provide a concrete path to analyzing larger families of neural network optimization algorithms, and practitioners, where these results provide a strong mathematical motivation for why these new algorithms work.
Primary Area: Theory->Optimization
Keywords: feature learning, representation learning, preconditioning, single-index models, two-layer networks, non-convex optimization, matrix sensing, quasi-newton methods, kronecker-factored approximate curvature, shampoo
Submission Number: 9288
Loading