Keywords: arge language model, Multi-Task Learning, Zeroth-order Fine-tuning, Parameter Efficient Fine-tuning
Abstract: Fine-tuning pre-trained Large Language Models (LLMs) for downstream tasks using First-Order (FO) optimizers presents significant computational challenges. Parameter-Efficient Fine-Tuning~(PEFT) methods have been proposed to address these challenges by freezing most model parameters and training only a small subset. While PEFT is efficient, it may not outperform full fine-tuning when high task-specific performance is required.
Zeroth-Order (ZO) methods offer an alternative for fine-tuning the entire pre-trained model by approximating gradients using only the forward pass, thus eliminating the computational burden of back-propagation,
% in first-order methods,
but they converge painfully slowly and are very sensitive to the choice of task prompts.
We bridge these worlds with Bilevel‑ZOFO, a penalty‑based bilevel formulation that treats adapter parameters as a lower‑level learner coupled to an upper‑level ZO optimizer of the full backbone. This double-loop optimization strategy only requires the gradient of the PEFT model and the forward pass of the base model. We provide theoretical convergence guarantees for Bilevel ZOFO. Empirically, we demonstrate that Bilevel-ZOFO significantly outperforms existing ZO methods, achieves 2–4$\times$ faster training, and reduces sensitivity to prompts. Bilevel-ZOFO also outperforms FO PEFT methods while maintaining similar memory efficiency. Additionally, we show its strong potential for meta learning.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 5483
Loading