TL;DR: An improved routing framework for expert diversification in MOE based on Mahalanobis distance and expert co-occurrence matrix.
Abstract: We introduce Mahalanobis-Pruned Mixture-of-Experts (MP-MoE), a novel routing framework that approaches expert selection from the perspective of ensemble pruning. Existing Mixture-of-Experts (MoE) routing strategies often suffer from representation collapse due to greedy top-k selection mechanisms or rely on complex auxiliary regularization terms that may compromise model performance. To address these issues, we formulate routing as a diversity-aware subset selection problem and optimize a Mahalanobis-distance-based objective that explicitly enhances expert diversity. Specifically, we demonstrate that the expert co-occurrence matrix effectively captures inter-expert correlations, allowing us to efficiently model the covariance structure required for distance computation without accessing expert parameters. Furthermore, we devise a greedy strategy for the routing mechanism, backed by theoretical approximation guarantees, rendering it a plug-and-play module with negligible overhead.
MP-MoE increases wall-clock training time by approximately 3\%, while incurring no additional latency at inference time.
Extensive experiments demonstrate that during the pre-training of the large language model, our method consistently outperforms the baseline by 1-3 percentage points across a broad range of benchmarks.
Lay Summary: This paper introduces MP-MoE, a new training method that helps different experts learn more distinct and complementary skills. Instead of simply choosing the experts with the highest scores, our method considers whether the selected experts are likely to provide diverse information. It does this by tracking how often experts are chosen together during training and using this information to avoid repeatedly selecting overly similar experts.
Experiments show that MP-MoE improves the quality of language model pre-training across several standard evaluation tasks. It achieves consistent gains over the standard expert-selection method, while adding only a small training cost and no extra cost during inference. This makes the method practical for improving large language models without making them slower to use.
Link To Code: https://github.com/kxlkxl1999/MP-MoE
Primary Area: Deep Learning->Large Language Models
Keywords: Mixture-of- Experts, Mahalanobis distance, ensemble pruning, experts diversity, large language models
Originally Submitted PDF: pdf
Submission Number: 5899
Loading