Improving MoE Performance and Efficiency with Plug-and-Play Intra-Layer Specialization and Cross-Layer Coupling Losses

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture-of-Experts, Auxiliary losses, Intra-Layer Specialization, Cross-layer Coupling, Plug-and-Play
TL;DR: We introduce two auxiliary losses for MoE—intra-layer specialization loss and cross-layer coupling loss—as plug-and-play modules within the Megatron-LM framework, enhancing both model performance and inference throughput.
Abstract: Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap, where different experts process similar tokens and learn redundant functions, resulting in ambiguous routing and underutilized capacity. While architectural solutions like DeepSeek-style shared experts promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. We propose two plug-and-play auxiliary losses that enhance MoE specialization and routing efficiency without modifying routers or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary functions. Second, a cross-layer dependency loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with shared-expert and vanilla Top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23256
Loading