Keywords: Mixture-of-Experts, Auxiliary losses, Intra-Layer Specialization, Cross-layer Coupling, Expert Overlap
TL;DR: We propose synergistic intra- and cross-layer regularization losses that mutually reinforce one another to drive stronger MoE expert specialization and more decisive routing—without changing the router or model architecture.
Abstract: Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap---redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. We propose two plug-and-play regularization losses that improve specialization and routing consistency without changing router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens to encourage complementary representations. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers to promote coherent expert pathways through depth. The two losses are mutually reinforcing: improved specialization reduces overlap and stabilizes pathways, while coupling reduces routing volatility and amplifies specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training and downstream benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.
Submission Number: 71
Loading