Convergence of Adafactor under Non-Convex Smooth Stochastic Optimization

15 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adafactor, stochastic optimization, convergence, non-convex smoothness
Abstract: As model sizes in deep learning continue to expand, memory-efficient optimizers are increasingly critical to manage the substantial memory demands of popular algorithms like Adam and AdamW. Among these, Adafactor has emerged as one of the widely adopted choices for training deep learning tasks, particularly large language models. However, despite its practical success, there is limited theoretical analysis on Adafactor's convergence. This paper presents a comprehensive analysis on Adafactor in a non-convex smooth setting, demonstrating its convergence to find a stationary point at a rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$. We find that the default hyper-parameter setting results in a sub-optimal rate in our framework, and propose an alternative setting that could theoretically achieve optimal convergence rate. This finding is further supported by some experimental results. We also prove that Adafactor with a suitable time-varying clipping threshold could also converge, achieving performance in experiments comparable to that of the standard constant setting.
Primary Area: Optimization (convex and non-convex, discrete, stochastic, robust)
Submission Number: 13557
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview