Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection

ACL ARR 2025 February Submission1434 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs), offering unprecedented computational efficiency. However, these architectures grapple with challenges of token distribution imbalance and expert homogenization, impeding optimal semantic generalization. We propose a novel expert routing framework that incorporates: (1) An efficient routing mechanism with lightweight computation. (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens. (3) A module that determines the lower bounds of expert capacity based on dynamic token distribution analysis, specifically designed to address drop-and-pad strategies. It is also integrated with orthogonal feature extraction module and an optimized loss function for expert localization. This framework effectively reduces expert homogeneity while enhancing the performance of the expert selection module. Additionally, we introduce a local expert strategy that simultaneously improves load balancing and reduces network communication overhead. It achieves a 40\% reduction in token processed by each expert without compromising model convergence or efficacy. When coupled with communication optimizations, the training efficiency improvements of 5.4\% to 46.6\% can be observed. After supervised fine-tuning, it exhibits performance gains of 9.7\% to 14.1\% across GDAD, GPQA, and TeleQnA benchmarks.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Machine Learning for NLP
Contribution Types: NLP engineering experiment
Languages Studied: English, Chinese, French, German, Spanish, Italian
Submission Number: 1434
Loading