Keywords: Mixture of Experts, Large Language Models, Efficient Foundation Models
Abstract: Mixture-of-Experts(MoE) efficiently trains large models by using sparse activation to lower costs, selecting a few experts based on data characteristics. For MoE, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly achieve an expert-centered balancing strategy to solve it, prioritizing equal utilization of experts over semantic alignment between tokens and experts.
However, this can lead to a pseudo-balance phenomenon: To ensure expert load balancing, the same input is randomly routed to different experts across training steps instead of the most matching one. It introduces two critical issues: (1) Severe knowledge overlap among experts, resulting in redundant representations and inefficient parameter utilization. (2) Difficulty in forming and stabilizing expert specialization. These issues limit the scalability of models, especially large language models(LLM).
To address these limitations, we introduce Memory-Aware Routing (MAR), an approach that enhances existing load-balancing strategies. By equipping each expert with a memory buffer, our method explicitly models their long-term preferences, allowing historical experience to guide routing. This ensures that tokens are routed more consistently to compatible experts, mitigating the pseudo-balance problem while maintaining global load balance and fostering expert specialization.
Experimental results show that Memory-Aware Routing improves expert specialization by 35\% and downstream accuracy by 2\%-25\%, doubles parameter efficiency, and matches baseline performance with only half the experts (one-quarter of the parameters).
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16573
Loading