Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

Rizhen Hu; Yuan Cao; Boao Kong; Mou Sun; Kun Yuan

Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

Rizhen Hu, Yuan Cao, Boao Kong, Mou Sun, Kun Yuan

Published: 03 Mar 2026, Last Modified: 07 Apr 2026ICLR 2026 DeLTa Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture-of-Experts, Auxiliary losses, Intra-Layer Specialization, Cross-layer Coupling, Expert Overlap

TL;DR: We propose synergistic intra- and cross-layer regularization losses that mutually reinforce one another to drive stronger MoE expert specialization and more decisive routing—without changing the router or model architecture.

Abstract: Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap---redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. We propose two plug-and-play regularization losses that improve specialization and routing consistency without changing router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens to encourage complementary representations. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers to promote coherent expert pathways through depth. The two losses are mutually reinforcing: improved specialization reduces overlap and stabilizes pathways, while coupling reduces routing volatility and amplifies specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training and downstream benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.

Submission Number: 71

Loading