Keywords: Adafactor, stochastic optimization, convergence, non-convex smoothness
Abstract: As model sizes in deep learning continue to expand, memory-efficient optimizers are increasingly critical to manage the substantial memory demands of popular algorithms like Adam and AdamW. Among these, Adafactor has emerged as one of the widely adopted choices for training deep learning tasks, particularly large language models. However, despite its practical success, there is limited theoretical analysis on Adafactor's convergence. This paper presents a comprehensive analysis on Adafactor in a non-convex smooth setting, demonstrating its convergence to find a stationary point at a rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$. We find that the default hyper-parameter setting results in a sub-optimal rate in our framework, and propose an alternative setting that could theoretically achieve optimal convergence rate. This finding is further supported by some experimental results.
We also prove that Adafactor with a suitable time-varying clipping threshold could also converge, achieving performance in experiments comparable to that of the standard constant setting.
Primary Area: Optimization (convex and non-convex, discrete, stochastic, robust)
Submission Number: 13557
Loading