Apollo: An Adaptive Parameter-wised Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization
Keywords: Optimization, Stochastic Optimization, Nonconvex, Quasi-Newton, Neural Network, Deep Learning
Abstract: In this paper, we introduce Apollo, a quasi-newton method for noncovex stochastic optimization, which dynamically incorporates the curvature of the loss function by approximating the Hessian via a diagonal matrix. Algorithmically, Apollo requires only first-order gradients and updates the approximation of the Hessian diagonally such that it satisfies the weak secant relation. To handle nonconvexity, we replace the Hessian with its absolute value, the computation of which is also efficient under our diagonal approximation, yielding an optimization algorithm with linear complexity for both time and memory. Experimentally, through three tasks on vision and language we show that Apollo achieves significant improvements over other stochastic optimization methods, including SGD and variants of Adam, in term of both convergence speed and generalization performance.
One-sentence Summary: An Adaptive Parameter-wised Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=bRBY_4YuAK
12 Replies
Loading