Keywords: mixtures of experts, theory, deep learning, feature noise, robustness
TL;DR: We prove tht MoEs can exploit activation sparsity to achieve high robutness to feature noise.
Abstract: Mixture of Experts (MoEs) allow deep neural networks to grow in size without incurring large inference costs. Theories to explain their success largely focus on enhanced expressiveness and generalization compared to single expert models. We identify a novel mechanism: MoEs can outperform single experts by utilizing activation sparsity amidst feature noise, even without increase in parameter count. This enables MoEs to achieve superior generalization performance, robustness, training convergence speed and sample complexity. Our results further offer a theoretical basis for techniques like MoEfication, which transform dense layers into MoE architectures by exploiting activation sparsity. Experiments on synthetic data and standard real-world language tasks support our theoretical insights.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 4848
Loading