Addax: Memory-Efficient Fine-Tuning of Language Models with a Combination of Forward-Backward and Forward-Only Passes

Published: 05 Mar 2024, Last Modified: 12 May 2024PML4LRS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, memory efficient Fine-tuning, Zeroth order optimization
Abstract: Fine-tuning language models (LMs) with first-order optimizers often demands excessive memory, limiting accessibility, while zeroth-order optimizers use less memory, but suffer from slow convergence depending on model size. We introduce a novel method named Addax that integrates the recently introduced Memory-Efficient Zeroth-order Optimizer of Malladi et al. (2023) with Stochastic Gradient Descent (SGD). Addax obtains zeroth-order and first-order gradient estimates and optimally combines them as the descent direction in each step. The first-order updates are performed "in-place" to further save memory. Theoretically, we establish the convergence of Addax under mild assumptions, demonstrating less restrictive hyper-parameters and independence from model size. Our extensive experiments with diverse LMs and tasks show that Addax consistently outperforms zero-shot and MeZO in terms of accuracy. Moreover, Addax surpasses the performance of standard fine-tuning approaches, such as SGD and Adam, in specific scenarios with significantly less memory requirement.
Submission Number: 5
Loading