MoE-DisCo:   Low Cost Training Mixture-of-Experts Models

MoE-DisCo: Low Cost Training Mixture-of-Experts Models

ACL ARR 2026 January Submission2655 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MoE, Low cost

Abstract: Training large-scale Mixture-of-Experts (MoE) models typically requires high-memory, high-bandwidth GPUs (e.g., A100), and their high cost has become a major barrier to large-model training. In contrast, affordable hardware such as DCUs costs less than \$0.03 per hour but is limited by memory capacity and bandwidth, making it unsuitable for direct LLM training. To address this, we propose MoE-DisCo (Mixture-of-Experts with Disentangled Clustering and Coordination)—a staged training framework. MoE-DisCo decomposes the MoE model into multiple dense submodels, each consisting of a shared backbone and a single expert, and partitions the training data into subsets using unsupervised clustering. Each submodel is trained independently and in parallel on its assigned data subset using low-cost devices, without any inter-device communication. Subsequently, all experts are integrated into a complete MoE model and fine-tuned globally for a short period on high-memory, high-bandwidth GPUs. Experiments show that our method matches or even surpasses end-to-end training in performance across multiple downstream tasks, while reducing expensive GPU usage time by over 70\%.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: MoE, Low-cost, training

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Theory

Languages Studied: English

Submission Number: 2655

Loading