Keywords: Adafactor, stochastic optimization, non-convex smooth optimization, convergence
Abstract: Adafactor, a memory-efficient variant of Adam, has emerged as one of the popular choices for training deep learning tasks, particularly large language models.
However, despite its practical success, there is limited theoretical analysis of Adafactor's convergence. In this paper, we present a comprehensive analysis of Adafactor in a non-convex smooth setting. We show that full-batch Adafactor finds a stationary point at a rate of $\tilde{O}(1/\sqrt{T})$ with the default setup, which could be accelerated to $\tilde{O}(1/T)$ with a constant step-size parameter. For stochastic Adafactor without update clipping, we prove a convergence rate of $\tilde{O}(1/\sqrt{T})$ with the right parameters covering the default setup. We also prove that Adafactor with a time-varying clipping threshold could also find a stationary point with the rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$. Our theoretical results are further complemented by some experimental results.
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13777
Loading