Track: Main paper track (up to 5 pages excluding references and appendix)
Keywords: Linear Sequence Modeling, Mixture-of-Experts, Distributed Training
Abstract: Linear Sequence Modeling (LSM) and Mixture-of-Experts (MoEs) have recently emerged as effective architectural improvements. In this paper, we introduce Linear-MoE, a production-level system for modeling and training large-scale models that integrate LSM with MoEs. Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training and deployment. The Linear-MoE system comprises two primary subsystems: Modeling and Training. The Modeling subsystem provides a unified framework supporting multiple types of LSM methods, including linear attention, SSM, and linear RNN. The Training subsystem facilitates efficient training by incorporating advanced parallelism techniques like Tensor, Pipeline, and Expert Parallelism, along with LASP-based Sequence Parallelism for managing very-long input sequences. The system is designed to be extensible for integrating more sequence modeling and training abilities in the future. Additionally, we explore hybrid Linear-MoE models that combine Linear-MoE layers with standard Transformer-MoE layers to further enhance model flexibility and performance. Experimental evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate that Linear-MoE achieves efficiency gains while maintaining competitive performance on various benchmarks. The code is released at: \url{https://github.com/OpenSparseLLMs/Linear-MoE}.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 98
Loading