QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

HamidReza Imani; Jiaxin Peng; Peiman Mohseni; Abdolah Amirany; Tarek El-Ghazawi

QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

HamidReza Imani, Jiaxin Peng, Peiman Mohseni, Abdolah Amirany, Tarek El-Ghazawi

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Abstract: The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single GPU. We propose a serving system that employs \textit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce \textit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves competitive output quality while maintaining throughput comparable to serving a single model, and incurs only a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85\% average reduction in turnaround time compared to NVIDIA's multi-instance GPU (MIG). Furthermore, experiments on Google's Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.

Lay Summary: Large language models like ChatGPT are powerful, but they require a lot of memory to run—especially when using a technique called “mixture of experts” (MoE), where different parts of the model are used depending on the task. This becomes a major challenge when multiple such models, each fine-tuned for a specific purpose, need to run on the same shared hardware, such as a single GPU. Our research introduces a more efficient way to serve several of these models together by reducing the overall memory requirement. We achieve this by identifying and sharing similar expert components between models to save memory, and dynamically swapping other parts in and out as needed to maintain response accuracy. This makes it possible to serve multiple MoE language models simultaneously, with performance close to that of running a single model and only a slight reduction in accuracy. Our system performs especially well in constrained environments, significantly reducing completion times compared to existing resource-sharing solutions. This can help make advanced AI tools more scalable and accessible—even on limited hardware.

Link To Code: https://github.com/hamid-74/Multi-MoE

Primary Area: Optimization->Large Scale, Parallel and Distributed

Keywords: Mixture-of-Experts, Large Language Models, Virtualization, Multi-Tenant Environments

Submission Number: 12372

Loading