Keywords: MoE, LLM, GPU Load Balance, Heterogeneous Experts
Abstract: Mixture-of-Experts (MoE) offers superior performance over dense models. However, current MoEs impose a critical limitation by enforcing uniform expert sizes, restricting the model's ability to dynamically match computational resources with token-specific requirements. Despite several attempts on heterogeneous experts have been made, they struggle either with limited performance and inefficient parameter utilization or unbalanced GPU utilization, there is still a lack of general heterogeneous MoE architecture.
To this end, we present Mixture of Heterogeneous Grouped Experts (MoHGE), an innovative MoE architecture that introduces a two-level routing mechanism and enables more nuanced and efficient expert selection tailored to each input token's characteristics. We also propose a Group-Wise Auxiliary Loss to enhance efficient parameter utilization without compromising model performance.
To address the resulted workload imbalance challenges, we develop: (1) an All-size Group-decoupling Allocation strategy and (2) Intra-Group Experts Auxiliary Loss, collectively ensuring balanced GPU utilization.
Extensive evaluations on multiple benchmarks demonstrate that MoHGE achieves comparable performance to state-of-the-art MoE architectures while reducing total parameter count by approximately 20\% and maintaining balanced GPU utilization. Our work establishes a new paradigm for resource-aware MoE design, better aligning computational allocation with actual inference demands.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10611
Loading