Yuan 2.0-M32: A MoE model with Localized Filtering-based Attention and Attention Router

ACL ARR 2025 May Submission1482 Authors

17 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this work, we develop and release Yuan 2.0-M32, a language model uses a mixture-of-experts architecture with 32 experts of which 2 experts are active. The Localized Filtering-Based Attention (LFA) is introduced to the base architecture, to incorporate prior knowledge of local dependencies of natural language into attention. A new router network, Attention Router, is adopted for a more efficient selection of experts, which improves the accuracy compared to the model with classical router network. A distributed training method with nonuniform pipeline parallel, data parallel, and optimizer parallel is proposed, which greatly reduces the bandwidth requirements of intranode communication, and achieves good performance in large-scale distributed training. Yuan 2.0-M32 is trained with 2000B tokens from scratch, and the training computation consumption is only 9.25% of a dense model at the same parameter scale.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Generation, Information Retrieval and Text Mining, NLP Applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English, Chinese
Submission Number: 1482
Loading