Bridging the Gap Between Zeroth-Order and First-Order Fine-Tuning via Dynamic Adaptive Subspace Pre-tuning
Keywords: Zeroth-Order Optimizers, Performance-Memory Trade-off, Online Fine-tuning
Abstract: Fine-tuning large language models (LLMs) faces a trade-off between the accuracy of first-order (FO) methods and the memory efficiency of zeroth-order (ZO) optimizers. While ZO methods avoid the activation memory bottleneck of backpropagation, they typically converge slowly and show a noticeable performance gap compared to FO approaches. To address this, we propose \textbf{Dynamic Adaptive Subspace Pre-tuning (DASP)}, a framework that combines the efficiency of ZO methods with the accuracy of FO methods. DASP introduces a lightweight pre-computation stage that constructs low-rank, layer-wise subspaces aligned with the loss landscape. Fine-tuning is then restricted to small transformation matrices within these fixed subspaces, greatly reducing optimizer state memory. To further eliminate activation memory overhead, DASP employs a streaming backpropagation algorithm that decouples peak memory from sequence length. Experiments on LLaMA3 and OPT-13B show that DASP consistently outperforms ZO baselines by large margins (e.g., +6.5\% on RTE with LLaMA3), while matching the accuracy of FO methods at even lower memory cost. These results highlight DASP as a practical and scalable solution for memory-efficient LLM adaptation.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 10411
Loading