everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Sparse Mixture of Experts (SMoE) has emerged as a breakthrough approach for achieving unprecedented scalability in deep learning. By enabling models to expand their parameter count exponentially while selectively activating only a small subset of parameters per sample, SMoEs maintain high efficiency. However, SMoE models are susceptible to routing fluctuations, leading to instability and non-robustness. In this work, we unveils SMoE-based attention as a point estimate of a regression function of a 3-layer hierarchical mixture of experts regression. Through this probabilistic graphical model (PGM) framework, we highlight the conditional independence in expert-selection process of tokens, which exposes the model to routing fluctuation and non-robustness. Motivating by this PGM framework, we propose Mutual-Inform SMoEs, including Similarity and Attention-Inform SMoE, which eliminate the assumption of conditional independence by allowing tokens to directly influence each other on expert-decisions. We theoretically demonstrate that our methods lower the entropy in decision-making, enabling more confident and consistent expert assignments. Finally, we empirically validate our models on ImageNet classification and Wikitext-103 language modeling, showing significant improvements in reducing routing fluctuations, enhancing performance, and increasing model robustness compared to baseline Transformer-SMoE models.