Zoop it! Efficient Zero-Order Optimization with Output Perturbation

Published: 10 Jun 2025, Last Modified: 01 Jul 2025TTODLer-FM @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Zeroth-order Optimization, Natural language processing
Abstract: Zero-order gradient methods offer a promising alternative for finetuning large models with constrained computational resources by estimating the gradient without full backpropagation. Recent approaches employ an efficient finite difference method that uses random noise perturbations across all model weights for gradient estimation. However, this approach suffers from high variance and slow convergence, as the number of parameters can be enormous. In a deep network that consists of a cascade of layers, instead of perturbing the entire model weights with random noises, one can alternatively only perturb the activation to compute a gradient estimation only for that activation layer, and backprop from there on to get a gradient estimation for the entire layer. This method reduces variance with the increase of batch sizes. Despite the potential advantages, efficient implementation of this technique for large model training remains unexplored. We introduce a new method, called Zero-order Optimization with Output Perturbation (Zoop), which can be easily integrated into existing network architectures. Zoop enables selective perturbation control to balance memory efficiency and gradient accuracy. We evaluated Zoop on autoregressive and masked language models up to 66 billion parameters across various tasks such as classification, multiple-choice, and text generation. Our findings show that Zoop effectively fine-tunes large language models with fewer updates than current methods, demonstrating its efficiency and effectiveness in model optimization.
Submission Number: 39
Loading