DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

ICLR 2026 Conference Submission21051 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture of experts; model scaling; structural learning
TL;DR: We analyze the limits of summation mixing in MoE and show that structural mixing enables richer expert interactions and better performance.
Abstract: Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models. Despite significant progress, effectively scaling MoE performance remains a challenge. Previous work shows that the use of fine-grained experts enlarges the space of expert combinations and can improve flexibility, but it also imposes substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling --- expert-output mixture. We first analyze the limitations of the standard weighted-summation aggregation in conventional MoE architectures. We then theoretically demonstrate that introducing structural aggregation both expands the expert-combination space without altering the experts or router configuration and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. We evaluate DAG-MoE under standard language modeling settings. Extensive experiments show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21051
Loading