TL;DR: We propose a preconditioner diagonalization framework that accelerates adaptive optimizers
Abstract: Modern deep learning heavily relies on adaptive optimization methods like Adam and its variants, celebrated for their robustness against model scale and ease of hyperparameter tuning. However, the gradient statistics employed by these methods often do not leverage sufficient gradient covariance information, leading to suboptimal updates in certain directions of the parameter space and potentially slower convergence. In this work, we keep track of such covariance statistics in the form of a structured preconditioner matrix. Unlike other works, our approach does not apply direct approximations to estimate this matrix. We instead _implement an invertible transformation that maps the preconditioner matrix into a new space where it becomes approximately diagonal_. This enables a diagonal approximation of the preconditioner matrix in the transformed space, offering several computational advantages. Empirical results show that our approach can substantially enhance the convergence speed of modern adaptive optimizers. Notably, for large language models like LLaMA, we can achieve a $2\times$ speedup in sample efficiency compared to Adam. In addition, our method can also be integrated with memory-efficient optimizers to manage computational overhead.
Code Dataset Promise: Yes
Signed Copyright Form: pdf
Format Confirmation: I agree that I have read and followed the formatting instructions for the camera ready version.
Submission Number: 1884
Loading