CORM: Coarse-to-fine-grained Offloading for SMoE LLM Inference on Consumer-grade GPU

ACL ARR 2024 December Submission1597 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) with sparse mixture-of-experts (SMoE) have shown empirical success in various tasks. The sparse routing strategy of SMoE increases the model capacity without proportionally increasing the computation cost by activating only a subset of the parameters (i.e., the experts). Unfortunately, compared to previous LLMs without SMoE, the capacity benefit of SMoE LLMs also brings additional substantial memory resources. Thus, deploying SMoE LLM in resource-limited scenarios is a challenging issue. Previous approaches involved offloading the expert weights of MoE models to the CPU, which significantly increased inference latency due to the need to copy expert weights between the CPU and GPU. To address this issue, we propose CORM, a Coarse-to-fine-grained offloading framework for SMoE large language model inference. This framework leverages the sparsity present in both the expert and neuron levels of large models, offloading both levels accordingly. We have designed an efficient and memory-saving coarse-to-fine-grained sparsity prediction mechanism that allows inference and weight prefetching to occur in parallel. We also implement a coarse-to-fine-grained caching strategy which minimizes the need for repeated weight loading. Our method has been proven through experiments to significantly accelerate SMoE LLMs inference, reducing latency by up to 2.14× while the model's accuracy only decreases by 1%. These features enable our coarse-to-fine grained offloading framework to efficiently deploy large-scale SMoE LLMs on a consumer-grade GPU.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: NLP in resource-constrained settings, quantization
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1597
Loading