Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

Fangshuo Liao; Anastasios Kyrillidis

Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

Fangshuo Liao, Anastasios Kyrillidis

Published: 03 Feb 2026, Last Modified: 02 May 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts, whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or restrictive top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router's learning process are ``guided" by the experts, that recovers the teacher's parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. Our analysis brings novel insight in understanding the optimization landscape of the MoE architecture.

Code Dataset Promise: Yes

Code Dataset Url: https://github.com/JLiao980706/Guided_by_the_Experts

Signed Copyright Form: pdf

Format Confirmation: I agree that I have read and followed the formatting instructions for the camera ready version.

Submission Number: 1687

Loading