Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

Reza Abbasi; Ser-Nam Lim

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

Reza Abbasi, Ser-Nam Lim

24 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: GPU memory optimization, Inference Efficiency, Training Optimization, Large-Scale Models, GPU-CPU data transfer

Abstract: The rapid growth in size and complexity of machine learning models, particularly in natural language processing and computer vision, has led to significant challenges in model execution on hardware with limited resources. This paper introduces Superpipeline, a novel framework designed to optimize the execution of large-scale AI models on constrained hardware for both training and inference phases. Our approach focuses on dynamically managing model execution by partitioning models into individual layers and efficiently transferring these partitions between GPU and CPU memory. Superpipeline achieves substantial reductions in GPU memory consumption—up to 60\% in our experiments—while maintaining model accuracy and acceptable processing speeds. This enables the execution of models that would otherwise exceed available GPU memory capacity. Unlike existing solutions that primarily target inference or specific model types, Superpipeline demonstrates broad applicability across large language models (LLMs), vision-language models (VLMs), and vision-based models. We evaluate Superpipeline's effectiveness through comprehensive experiments on diverse models and hardware configurations. Our method is characterized by two key parameters that allow fine-tuning of the trade-off between GPU memory usage and processing speed. Importantly, Superpipeline does not require model retraining or parameter modification, ensuring full preservation of the original model's output fidelity. The simplicity and flexibility of Superpipeline make it a valuable tool for researchers and practitioners working with state-of-the-art AI models under hardware constraints. It enables the use of larger models or increased batch sizes on existing hardware, potentially accelerating innovation across various machine learning applications. This work represents a significant step towards democratizing access to advanced AI models and optimizing their deployment in resource-constrained environments.

Primary Area: optimization

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3868

Loading