Keywords: linear mode connectivity, mixture-of-experts
TL;DR: We investigate Linear Mode Connectivity (LMC) in Mixture-of-Experts (MoE) architectures by analyzing their underlying permutation symmetries and proposing expert-matching algorithms that align independently trained MoEs to reveal LMC.
Abstract: Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes
of neural networks, wherein independently trained models have been observed to
be connected—up to permutation symmetries—by linear paths in parameter space
along which the loss remains consistently low. This observation challenges classical
views of non-convex optimization and has implications for model ensembling,
generalization, and our understanding of neural loss geometry. Inspired by recent
studies on LMC in standard neural networks, we systematically investigate this
phenomenon within Mixture-of-Experts (MoE) architectures—a class of models
known for their scalability and computational efficiency, which combine traditional
neural networks—referred to as experts—through a learnable gating mechanism.
We begin by conducting a comprehensive analysis of both dense and sparse gating
regimes, demonstrating that the symmetries inherent to MoE architectures are
fully characterized by permutations acting on both the expert components and the
gating function. Building on these foundational findings, we propose a matching
algorithm that enables alignment between independently trained MoEs, thereby
facilitating the discovery of LMC. Finally, we empirically validate the presence of
LMC using our proposed algorithm across diverse MoE configurations—including
dense, sparse, and shared-expert variants—under a wide range of model settings
and datasets of varying scales and modalities. Our results confirm the existence
of LMC in MoE architectures and offer fundamental insights into the functional
landscape and optimization dynamics of deep learning models.
Supplementary Material: zip
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 9016
Loading