ZO-Offloading: Fine-Tuning LLMs with 100 Billion Parameters on a Single GPU

Liangyu Wang; Jie Ren; Hang Xu; Junxiao Wang; David E. Keyes; Di Wang

ZO-Offloading: Fine-Tuning LLMs with 100 Billion Parameters on a Single GPU

Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, David E. Keyes, Di Wang

27 Sept 2024 (modified: 01 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, zeroth-order optimization, efficient CPU-offloading, memory efficient fine-tuning

Abstract: Fine-tuning pre-trained LLMs typically requires a vast amount of GPU memory. Standard first-order optimizers like SGD face a significant challenge due to the large memory overhead from back-propagation as the size of LLMs increases, which necessitates caching activations during the forward pass and gradients during the backward pass. In contrast, zeroth-order (ZO) methods can estimate gradients with only two forward passes and without the need for activation caching. Additionally, CPU resources can be aggregated and offloaded to extend the memory and computational capacity of a single GPU. To enable efficient fine-tuning of LLMs on a single GPU, we introduce ZO-Offloading, a framework that strategically utilizes both CPU and GPU resources for ZO. ZO-Offloading dynamically offloads model parameters to the CPU and retrieves them to the GPU as needed, ensuring continuous and efficient computation by reducing idle times and maximizing GPU utilization. Parameter updates are integrated with ZO's dual forward passes to minimize redundant data transfers, thereby improving the overall efficiency of the fine-tuning process. The ZO-Offloading framework also incorporates a novel low-bit precision technique for managing data transfers between the CPU and GPU in AMP mode, as well as asynchronous checkpointing for LLM fine-tuning. With ZO-Offloading, for the first time, it becomes possible to fine-tune extremely large models, such as the OPT-175B with over $\textbf{175 billion}$ parameters, on a single GPU with just $\textbf{24GB}$ of memory—a feat unattainable with conventional methods. Moreover, our framework operates without any additional time cost compared to standard ZO methodologies.

Supplementary Material: pdf

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8369

Loading