Autonomy-of-Experts Models

Ang Lv; Ruobing Xie; Yining Qian; Songhao Wu; Xingwu Sun; Zhanhui Kang; Di Wang; Rui Yan

Autonomy-of-Experts Models

Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Abstract: Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and learning. To address this, we propose Autonomy-of-Expert (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

Lay Summary: Modern large language models often use a technique called "Mixture-of-Experts" (MoE), where only a portion of the model is activated for each input, saving time and resources. Typically, a separate module called a "router" decides which parts of the model—namely, the experts—to activate. However, the separation between the router and the experts can reduce training efficiency, cause imbalanced workloads, and lower overall performance. In our work, we introduce a new method called "Autonomy-of-Experts," in which each expert decides for itself whether to activate, based on its activation scale (i.e., how useful it expects to be). This removes the need for a router and leads to improved performance.

Link To Code: https://github.com/trestad/Autonomy-of-Experts

Primary Area: Deep Learning->Large Language Models

Keywords: Mixture-of-Experts, Autonomy-of-Experts, Language models

Submission Number: 2647

Loading