Distributed Population-Based Simultaneous Perturbation Stochastic Approximation for Fine-Tuning Large Language Models
Abstract: Very recently, a memory-efficient version (called MeZO) of simultaneous perturbation stochastic approximation (SPSA), one well-established zeroth-order optimizer from the automatic control community, has shown competitive fine-tuning performance on large language models (LLMs) with billions of weights under the limited memory of a single GPU, which often hinders the direct usage of the standard backpropagation (BP) update. Furthermore, population-based evolutionary methods (such as evolution strategies from OpenAI and population-based training from DeepMind) have shown a promising direction to improving and scaling up various large-scale AI systems via efficiently utilizing increasingly available distributed computing resources, owing to their relatively simple yet efficient data and control flows. In this paper, we propose a population-based SPSA for fine-tuning LLMs with non-differentiable objective functions. Specifically, our proposed method combines the desirable exploration ability of distributed population-based evolutionary methods with the powerful exploitation ability of SPSA in the modern distributed computing environment wherein the efficient implementation of BP is still challenging because of its complex dependencies in forward-pass and backward-pass information flows. Simulation experiments on a set of challenging (non-differentiable) natural language process tasks empirically validate the potential of our method on fine-tuning even in a relatively small-scale distributed computing scenario.
Loading