AdaMeM: Memory Efficient Momentum for Adafactor

Published: 18 Jun 2024, Last Modified: 17 Jul 2024WANT@ICML 2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: optimization, adam, adafactor, momentum, space, memory
TL;DR: We create a memory efficient momentum for Adafactor which closes a large fraction of the gap between Adafactor and Adam while outpeforming low rank optimizers such as GaLore.
Abstract: Adafactor is a memory efficient algorithm which does not maintain momentum and has near 0 memory overhead as compared to gradient descent. However it performs worse than Adam in many setups. Prior works have shown that this gap can be removed by adding momentum to Adafactor. This comes at the cost of increased memory requirements. In this work we use the ideas of low rank optimizers such as LoRA and GaLore to maintain momentum on a low rank subspace of the weights on top of Adafactor to give a new optimizer: AdaMeM. However unlike low rank optimizers we still utilize full rank gradients but maintain momentum only on the top SVD subspace of the gradients. We show results on language modelling for models of size 210M and 550M demonstrating improved performance over Adafactor and GaLore. We also give theoretical arguments supporting the design of AdaMeM.
Submission Number: 48
Loading