From Compression to Specialization: An Information-Preserving Approach for Dense to Mixture-of-Experts Construction

From Compression to Specialization: An Information-Preserving Approach for Dense to Mixture-of-Experts Construction

ICLR 2026 Conference Submission15131 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture-of-Experts, Dense-to-Sparse Conversion, Data-Driven Model Compression, Expert Construction, Parameter-Efficient Training

Abstract: The high cost of training Mixture-of-Experts (MoE) models from scratch has spurred interest in converting pre-trained dense models into sparse MoE models. However, existing dense-to-sparse MoE methods are constrained by a fundamental trade-off between initial expert diversity and knowledge inheritance, often requiring extensive post-training to be effective. We address this by proposing a new expert construction paradigm that repurposes data-driven model compression, and validate that low-rank factorization is uniquely effective at balancing this trade-off. Based on this insight, we introduce MIDAS, a framework that crafts specialized experts by applying low-rank factorization to a base model, guided by distinct calibration datasets. Under limited compute budgets, MIDAS significantly outperforms existing dense-to-sparse approaches through a parameter-efficient strategy that trains only its gating network and low-rank adapters. Crucially, we demonstrate that MIDAS improves model stability by mitigating the severe load imbalance found in prior work, while also producing experts with clear, interpretable specializations that align with established Transformer functional theory. Overall, MIDAS presents a robust and efficient pathway for MoE construction, addressing the diversity-knowledge trade-off through an information-preserving approach.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 15131

Loading