Keywords: Zeroth-order optimizer, LLM, Fine-tuning
TL;DR: A low-memory cost zeroth-order optimizer for LLM fine-tuning with Adam-styled moments without maintaining them in the memory.
Abstract: Fine-tuning LLMs is necessary for dedicated downstream uses, but classic backpropagation approaches necessitate a large amount of GPU memory. To this end, a recent work, MeZO, which relies solely on forward passes to fine-tune LLMs, significantly reduces GPU requirements at the cost of slower convergence due to its indifference to loss landscapes. Standard solutions like Adam explore loss landscapes by estimating the first and second-order moments and keeping them in memory to guide the models in moving faster through dimensions with smaller curvature and vice versa. However, directly applying Adam negates MeZO's advantage as it will triple the memory requirement. In light of this, we propose AdaMeZO, a zeroth-order optimizer enhanced by Adam-styled first and second moments estimates, but without keeping them in memory. We present a theoretical analysis of AdaMeZO, corroborated by extensive experiments demonstrating AdaMeZO's performance, showing that AdaMeZO can outperform MeZO while taking up to $70\%$ fewer forward passes. Visualizations of trajectories on toy functions to affirm AdaMeZO's ability to adapt to different loss landscapes.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16126
Loading