Union-of-Experts: Experts in Mixture-of-Experts are Secretly Routers

Union-of-Experts: Experts in Mixture-of-Experts are Secretly Routers

ICLR 2026 Conference Submission20138 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture-of-Experts, Language models

Abstract: Mixture-of-Experts (MoE) is a foundational architecture in modern large language models (LLMs). However, a structural limitation has been overlooked: the router is external to the experts, rendering it unaware of their internal capabilities. This gap between routing decisions and expert capabilities limits model performance. In this paper, we demonstrate that the activations of a small subset of “routing neurons” within each routed expert’s own parameters can faithfully capture the match between the expert’s capabilities and input tokens. Collectively, these distributed routing neurons within each routed experts compose an implicit, capabilities-aware “router”, where the norm of the routing neurons’ activations suggests its corresponding expert’s weight. A straightforward implementation of this design requires activating all experts to compute these routing signals, where the unselected experts’ routing neurons are abandoned. To avoid the computational waste from activating unselected experts, we introduce another novel design: we unify the routing neurons of all routed experts to form a virtual shared expert, replacing the standard shared expert in MoE. In this virtual shared expert, activations are not wasted, as they serve not only for routing but also contribute to the final outputs of both the shared expert and partial of routed experts. We name this new MoE variant Union-of-Experts (UoE), drawing an analogy where the routing neuron acts as each expert’s representative, and the virtual shared expert is their union, enabling the experts’ autonomous selection and joint statement. We pre-train language models ranging from 1B to 3B parameters, showing that UoE consistently outperforms strong MoE baselines with comparable efficiency.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 20138

Loading