Keywords: mixture of experts, deep learning, theory, expressivity, granularity
TL;DR: We prove that increasing the granularity in Mixture of Experts increases the function class they can represent.
Abstract: Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed *granularity*, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4 models). We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity. Experimental results corroborate our theoretical findings and illustrate this separation.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 21077
Loading