DIRICHLET-PRIOR SHAPING: GUIDING EXPERT SPECIALIZATION IN UPCYCLED MOES

Published: 03 Mar 2026, Last Modified: 03 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture of Experts, Upcycling, Router, Distribution shaping, MLLM, VLM, MoE training, Specialization, Prior
Abstract: Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity during post-training, but often suffers from poor expert specialization due to naive weight replication. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior resulting in enhanced expert specialization. DPSL enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention. DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models show that DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering higher-performing models.
Submission Number: 50
Loading