Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning

Ryotaro Kawata; Kohsei Matsutani; Yuri Kinoshita; Naoki Nishikawa; Taiji Suzuki

Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning

Ryotaro Kawata, Kohsei Matsutani, Yuri Kinoshita, Naoki Nishikawa, Taiji Suzuki

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mixture of Experts (MoE), an ensemble of specialized models equipped with a router that dynamically distributes each input to appropriate experts, has achieved successful results in the field of machine learning. However, theoretical understanding of this architecture is falling behind due to its inherent complexity. In this paper, we theoretically study the sample and runtime complexity of MoE following the stochastic gradient descent when learning a regression task with an underlying cluster structure of single index models. On the one hand, we show that a vanilla neural network fails in detecting such a latent organization as it can only process the problem as a whole. This is intrinsically related to the concept of *information exponent* which is low for each cluster, but increases when we consider the entire task. On the other hand, with a MoE, we show that it succeeds in dividing the problem into easier subproblems by leveraging the ability of each expert to weakly recover the simpler function corresponding to an individual cluster. To the best of our knowledge, this work is among the first to explore the benefits of the MoE framework by examining its SGD dynamics in the context of nonlinear regression.

Lay Summary: Mixture of Experts (MoE) models dynamically route inputs to specialized sub-models called experts, enabling remarkable efficiency in large-scale learning settings such as large language models (LLMs). Yet, the theoretical understanding of MoE remains limited. This paper investigates how MoE models, when trained via stochastic gradient descent (SGD), can provably detect and learn hidden cluster structures in nonlinear regression tasks. We demonstrate that a standard neural network cannot separate these clusters, as it treats the problem as a whole. In contrast, MoE models, under appropriate conditions, leverage the specialization of experts to divide the task into simpler subproblems, enabling more efficient learning. To the best of our knowledge, this is among the first works to analyze the dynamics of MoE under SGD in nonlinear regression.

Primary Area: Theory->Learning Theory

Keywords: Mixture of Experts, feature learning, single-index models

Submission Number: 8925

Loading