Keywords: Mixture-of-Experts, Language models
Abstract: Mixture-of-Experts (MoE) is a foundational architecture in modern large language
models (LLMs). However, a structural limitation has been overlooked: the router
is external to the experts, rendering it unaware of their internal capabilities. This
gap between routing decisions and expert capabilities limits model performance.
In this paper, we demonstrate that the activations of a small subset of “routing neurons” within each routed expert’s own parameters can faithfully capture the match
between the expert’s capabilities and input tokens. Collectively, these distributed
routing neurons within each routed experts compose an implicit, capabilities-aware
“router”, where the norm of the routing neurons’ activations suggests its corresponding expert’s weight. A straightforward implementation of this design requires
activating all experts to compute these routing signals, where the unselected experts’ routing neurons are abandoned. To avoid the computational waste from
activating unselected experts, we introduce another novel design: we unify the
routing neurons of all routed experts to form a virtual shared expert, replacing the
standard shared expert in MoE. In this virtual shared expert, activations are not
wasted, as they serve not only for routing but also contribute to the final outputs of
both the shared expert and partial of routed experts. We name this new MoE variant
Union-of-Experts (UoE), drawing an analogy where the routing neuron acts as each
expert’s representative, and the virtual shared expert is their union, enabling the
experts’ autonomous selection and joint statement. We pre-train language models
ranging from 1B to 3B parameters, showing that UoE consistently outperforms
strong MoE baselines with comparable efficiency.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20138
Loading