Learning Mixtures of Experts with EM: A Mirror Descent Perspective

Quentin Fruytier; Aryan Mokhtari; Sujay Sanghavi

Learning Mixtures of Experts with EM: A Mirror Descent Perspective

Quentin Fruytier, Aryan Mokhtari, Sujay Sanghavi

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We show theoretical convergence properties of the Expectation Maximization algorithm for a general class of Mixtures of Experts by viewing the algorithm as projected Mirror Descent with unit step-size and Kullback Leibler divergence regularizer.

Abstract: Classical Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate "expert" model trained on each partition. Recently, MoE-based model architectures have become popular as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent-type methods on the log-likelihood. In this paper we study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for MoE where the conditional distribution of the target and latent variable conditioned on the feature variable belongs to an exponential family of distributions and show its equivalence to projected Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence; In the special case of mixture of 2 linear or logistic experts, we additionally provide guarantees for linear convergence based on the signal-to-noise ratio. Experiments on synthetic and (small-scale) real-world data supports that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy.

Lay Summary: Machine learning models called Mixtures of Experts (MoE) work by dividing up the input space and assigning a specialized model—or "expert"—to each part. These models have recently become popular for their ability to reduce the cost of training and making predictions, especially in large-scale applications like Large Language Models (LLM). Typically, both the way the input is divided and the experts themselves are learned using a technique called gradient descent. In our work, we revisit a classic but often overlooked alternative: the Expectation-Maximization (EM) algorithm. We show that, in the context of MoE training, EM has a connection to a modern optimization technique called Mirror Descent, and we use this link to better understand how and when EM works well for training MoE models. In particular, we identify conditions where EM can converge quickly and reliably. We also provide mathematical guarantees for this behavior in simpler models, and our experiments confirm that EM often performs better than gradient descent—not only learning faster, but also achieving higher accuracy. This highlights EM as a strong and theoretically grounded option for training expert-based models.

Primary Area: Theory->Optimization

Keywords: Expectation Maximization (EM), Mixtures of Experts (MoE), Mirror Descent

Submission Number: 12701

Loading