Balance Beam: adaptive computation for affordable training and inference with high-throughput offloading for LLMs

Yiqi Zhang; Yang You

Balance Beam: adaptive computation for affordable training and inference with high-throughput offloading for LLMs

Yiqi Zhang, Yang You

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: Deep Learning, Heterogenous Computing, Offloading, Large Batch Training, Large Language Model

Abstract: With the surging growth of model parameters, foundation models pose unprecedented challenges to traditional computational infrastructure. These large models intrinsically require substantial accelerator memory to accommodate massive tensors including model weights, activations, and optimizer states during pre-training, fine-tuning or even inference stages. To alleviate such intense pressure on memory, besides introducing excessive accelerators to suffice high demand of memory, offloading these parameters from accelerator to other storage medium such as DRAM is a preferable option for fine-tuning or inference with the model under computationally restricted circumstances. However, the prohibitive costs of data movement render it a theoretically plausible yet practically unpreferred solution. Previously state-of-the-art methodologies enhanced inference performance by retaining partial model state $\textit{in-situ}$ across multiple mini batches to boost inference performance but incur intricate hyperparameters and excessive overhead of exchanging cache. In this work, we propose a comprehensive workflow to address these challenges, with focuses on dynamic analysis of model-system compatibility and prioritizing computational intensity over data movement. We have shown that the proposed workflow facilitates both fine-tuning and inference of foundation models with higher throughput in restricted computational resources. Compared to state-of-the-art approach, our framework attains a remarkable speedup of over 4x for training and 2x for inference, using a 30-billion parameter model on a singular NVIDIA A100 GPU.

Primary Area: infrastructure, software libraries, hardware, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7297

Loading