Granularity boosts expressivity in Mixture of Experts

Granularity boosts expressivity in Mixture of Experts

ICLR 2026 Conference Submission21077 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mixture of experts, deep learning, theory, expressivity, granularity

TL;DR: We prove that increasing the granularity in Mixture of Experts increases the function class they can represent.

Abstract: Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed *granularity*, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4 models). We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity. Experimental results corroborate our theoretical findings and illustrate this separation.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 21077

Loading