Keywords: zeroth-order optimization, LLM fine-tuning
TL;DR: We propose Pseudo-Zeroth-Order (PseuZO), a framework that estimates Jacobian matrices via model output differentiation and applies EMA for variance reduction, with theoretical convergence guarantees and empirical validation.
Abstract: Zeroth-order Optimization (ZO) has received wide attention in machine learning, especially when computing full gradient is expensive or even impossible. Recently, ZO has emerged as an important paradigm for memory-efficient fine-tuning of large language models (LLMs), circumventing the memory overhead of backpropagation. However, existing ZO gradient estimators exhibit dimension-dependent variance scaling as $\Theta(d)$, leading to dimension-dependent convergence rates without further assumptions on the objective function, which is prohibitive for large-scale LLM parameters. To address this problem, we present a Pseudo-Zeroth-Order (PseuZO) framework for optimizing composite objective functions, especially large-scale models: $ \min_{\mathbf{x} \in \mathcal{X}} \mathcal{F}(\mathbf{x})= \bbE_{\mathbf{z}} g\circ h(\mathbf{x};\mathbf{z}) $, where $h$ represents complex, high-dimensional representations and $g$ is a task-specific loss. While existing zeroth-order methods estimate gradients with final loss functions, our PseuZO algorithm estimate the Jacobian matrix of $h(\mathbf{x})$ with the model output $\mathbf{o}= h(\mathbf{x})$, and the gradient of the loss function on model output $\mathbf{e} = \nabla_{\mathbf{o}} g(\mathbf{o})$, and apply exponential moving average on Jacobian estimators to reduce the variance. Moreover, we use the sliding window technique to reduce memory costs. Our algorithm achieves an $O( \max \lbrace \alpha_1 L\epsilon^{-2}, \alpha_1 L \sigma_2^2\epsilon^{-4} \rbrace )$ convergence rate, where $\alpha_1$ is the effective dimension of $\mathcal{F}$.
Experimental results demonstrate that PseuZO outperforms MeZO and MeZO-SVRG in classification, multiple choice and generation tasks in both full-parameter and PEFT fine-tuning settings by boosting convergence in the early stages of training. For instance, under the same computation time, with respect to SST2 task, PesuZO gets 9.8\% higher accuracy than MeZO (91.2\% v.s. 82.4\%). With the sliding window technique, our PseuZO achieves $70\%\sim80\%$ memory reduction compared to FO-SGD for different model sizes as PseuZO only introduced a small dimension-independent memory overhead, which enables efficient scaling of the model size. The code is available at https://github.com/YangBigMn/PseuZO.
$\newcommand{\bbE}{\mathbb{E}}$
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 10628
Loading