Keywords: Adafactor, stochastic optimization, non-convex smooth optimization, convergence rate
Abstract: Adafactor is an early memory-efficient optimization algorithm proposed as an alternative to Adam. By eliminating first-order momentum and employing a
rank-$1$ matrix factorization to approximate the second-moment matrix, Adafactor achieves near-zero memory overhead compared to traditional gradient descent methods.
Despite its practical suitability for large-scale training tasks where memory efficiency is critical, its theoretical convergence analysis remains unexplored, largely due to the challenges posed by its matrix factorization and update clipping mechanisms. In this work, we provide a convergence analysis of Adafactor for non-convex smooth optimization.
We establish optimal convergence rates (up to logarithmic factors) for finding stationary points in both deterministic and stochastic settings, the latter under sub-Gaussian noises.
Central to our analysis involves viewing Adafactor as an approximation of Adam, and the use of a new proxy step-size to approximate the unique
adaptive step-size induced by Adafactor's matrix factorization and update clipping, along with an induction argument to control the gradient magnitude.
Our finding may theoretically suggest that involving rank-$1$ matrix approximation of the second-moment matrix in Adam does not fundamentally hinder the convergence.
Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)
Submission Number: 12434
Loading