ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning

TMLR Paper5955 Authors

21 Sept 2025 (modified: 14 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities but face deployment challenges due to their high computational demands. Traditional pruning methods reduce these costs by permanently removing parameters, which inevitably leads to performance degradation. To mitigate this issue, we propose ToMoE, a method that transforms dense LLMs into Mixture-of-Experts (MoE) models by uncovering experts inherently present within dense models, without requiring any weight updates. ToMoE leverages dynamic structural pruning to unify expert construction and router training in a single stage, achieving consistently strong performance. Remarkably, even without fine-tuning \revise{the model weights}, ToMoE consistently outperforms state-of-the-art pruning and MoE techniques across Phi-2, LLaMA-2, LLaMA-3, and Qwen-2.5 models.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yu_Cheng1
Submission Number: 5955
Loading