Smoothness Bridges Sparsity and Stability in MoEs

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture-of-Experts (MoE), Model Sparsity, Training Stability
TL;DR: This paper establishes and verifies the trade-off between expert sparsity and training stability in Mixture-of-Experts, proposing a novel structure that enhances stability without sacrificing sparsity.
Abstract: Mixture of experts architectures have recently emerged as an effective approach for scaling model capacity while managing computational costs by leveraging expert sparsity, where only a subset of experts is activated during inference. Despite their computational efficiency, MoE models face challenges in training stability compared to their dense counterparts, largely due to the introduction of expert sparsity. While several methods have been proposed to mitigate this instability, the underlying relationship between expert sparsity and training stability remains unclear. In this work, we develop a theoretical framework that demonstrates an inverse correlation between training stability and expert sparsity, with gradient smoothness serving as the bridge. We derive an upper bound on training stability, formalizing for the first time the sparsity-stability trade-off in MoE models. Our findings show that activating more experts enhances gradient smoothness and improves training stability but at the cost of reduced sparsity. We validate our theory through extensive experiments on various architectures and datasets, and propose a novel MoE structure that addresses stability without sacrificing sparsity. This design introduces independent router heads and a soft top-$K$ selection via sampling without replacement, which smooths the gradient landscape while maintaining expert sparsity. Further analysis confirms the promise of this structure in striking the optimal balance between sparsity and stability, offering a new direction for optimizing MoE architectures in large-scale models.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8271
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview