Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity

Published: 16 Jun 2024, Last Modified: 16 Jun 2024HiLD at ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adam, coordinate-wise, adaptivity, $ell_\infty$
TL;DR: We give a new convergence bound for Adam using coordinate-wise $\ell_\infty$ smoothness rather than the more common $\ell_2$ smoothness, which yields a much better empirical smoothness constant for GPT-2 models.
Abstract: Adam outperforms SGD in transformer optimization for language modeling tasks. Yet such benefits are not well-understood theoretically -- previous theoretical convergence analysis for Adam and SGD mainly focus on the number of steps $T$ and are already minimax-optimal in non-convex cases, which are both $O(T^{-1/4})$. In this work, we argue that the better dependency on the loss smoothness and model dimension is the key that Adam optimizes faster than SGD, which is typically much larger than total steps for modern language modeling tasks. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$ geometry rather than the more common $\ell_2$ geometry, which yields a much better empirical smoothness constant for GPT-2 models. Moreover, we show that if we rotate the pretraining loss randomly, Adam can be outperformed by some variants of SGD which is invariant to rotations. This implies that any practically relevant explanation of Adam's optimization benefit must involve non-rotational invariant properties of loss, such as $\ell_\infty$ smoothness as used in our analysis.
Student Paper: Yes
Submission Number: 76
Loading