Orchestrating Heterogeneous Architectures for Fast Inference of Mixture-of-Experts Models

Published: 22 Jan 2025, Last Modified: 11 Feb 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: NLP in resource-constrained settings, inference methods
Abstract: Large Language Models (LLMs) with the Mixture-of-Experts (MoE) architectures have shown promising performance on various tasks. However, due to the huge model sizes, running them in resource-constrained environments where GPU memory is not abundant is challenging. Some existing systems propose to use CPU resources to solve that, but they either suffer from significant overhead of frequently moving data between CPU and GPU, or fail to consider different characteristics of CPU and GPU. This paper proposes Twiddler, a resource-efficient inference system for MoE models with limited GPU resources. Twiddler strategically utilizes the heterogeneous computing architecture of CPU and GPU resources by determining the optimal execution strategy. Our evaluation shows that, unlike state-of-the-art systems that optimize for specific scenarios such as single batch inference or long prefill, Twiddler has better performance in all scenarios. Twiddler achieves 1.26 times speed up in single batch inference, 1.30 times in long prefill processing, and 11.57 times in beam search inference, compared against different baselines.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2885
Loading