Apollo: An Adaptive Parameter-wised Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization

Xuezhe Ma

Apollo: An Adaptive Parameter-wised Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization

Xuezhe Ma

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Optimization, Stochastic Optimization, Nonconvex, Quasi-Newton, Neural Network, Deep Learning

Abstract: In this paper, we introduce Apollo, a quasi-newton method for noncovex stochastic optimization, which dynamically incorporates the curvature of the loss function by approximating the Hessian via a diagonal matrix. Algorithmically, Apollo requires only first-order gradients and updates the approximation of the Hessian diagonally such that it satisfies the weak secant relation. To handle nonconvexity, we replace the Hessian with its absolute value, the computation of which is also efficient under our diagonal approximation, yielding an optimization algorithm with linear complexity for both time and memory. Experimentally, through three tasks on vision and language we show that Apollo achieves significant improvements over other stochastic optimization methods, including SGD and variants of Adam, in term of both convergence speed and generalization performance.

One-sentence Summary: An Adaptive Parameter-wised Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=bRBY_4YuAK

12 Replies

Loading