Orchestrating Heterogeneous Architecture for Fast Inference of Mixture-of-Experts Models

Orchestrating Heterogeneous Architecture for Fast Inference of Mixture-of-Experts Models

ACL ARR 2024 June Submission730 Authors

12 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) with the Mixture-of-Experts (MoE) architectures have shown promising performance on various tasks. However, due to the huge model sizes, running them in resource-constrained environments where GPU memory is not abundant is challenging. Some existing systems propose to use CPU resources to solve that, but they either suffer from significant overhead of frequently moving data between CPU and GPU, or fail to consider different characteristics of CPU and GPU. This paper proposes Twiddler, a resource-efficient inference system for MoE models with limited GPU resources. Twiddler strategically utilizes the heterogeneous computing architecture of CPU and GPU resources by determining the optimal execution strategy. Our evaluation shows that, unlike state-of-the-art systems that optimize for specific scenarios such as single batch inference or long prefill, Twiddler has better performance in all scenarios. Twiddler achieves 1.26 times speed up in single batch inference, 1.30 times in long prefill processing, and 11.57 times in beam search inference, compared against different baselines.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: NLP in resource-constrained settings, inference methods, Special Theme Track

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 730

Loading