Convergence of Adafactor under Non-Convex Smooth Stochastic Optimization

Yusu Hong; Junhong Lin

Convergence of Adafactor under Non-Convex Smooth Stochastic Optimization

Yusu Hong, Junhong Lin

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Adafactor, stochastic optimization, non-convex smooth optimization, convergence

Abstract: Adafactor, a memory-efficient variant of Adam, has emerged as one of the popular choices for training deep learning tasks, particularly large language models. However, despite its practical success, there is limited theoretical analysis of Adafactor's convergence. In this paper, we present a comprehensive analysis of Adafactor in a non-convex smooth setting. We show that full-batch Adafactor finds a stationary point at a rate of $\tilde{O}(1/\sqrt{T})$ with the default setup, which could be accelerated to $\tilde{O}(1/T)$ with a constant step-size parameter. For stochastic Adafactor without update clipping, we prove a convergence rate of $\tilde{O}(1/\sqrt{T})$ with the right parameters covering the default setup. We also prove that Adafactor with a time-varying clipping threshold could also find a stationary point with the rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$. Our theoretical results are further complemented by some experimental results.

Primary Area: optimization

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13777

Loading