Keywords: Large Language Models, Mixture of Experts, Upcycling, Model Compression
TL;DR: Dense2MoE unifies layer pruning and upcycling to push the efficiency-accuracy Pareto frontier by routing tokens through a learned subset of retained MLPs organized as MoEs.
Abstract: The Mixture of Experts (MoE) architecture has become a mainstream design in Large Language Models (LLMs) for its ability to flexibly scale parameters while maintaining inference efficiency. However, training MoE models from scratch remains prohibitively expensive due to their high computational demands. Existing upcycling methods reduce costs by converting dense LLMs into MoEs through layer duplication and fine-tuning, but introduce substantial redundancy. While layer pruning can reduce such redundancy, it often leads to notable performance degradation. We propose Dense2MoE, a novel approach that unifies layer pruning and upcycling. Our method prunes highly redundant layers in an LLM while retaining their MLPs in the form of MoE. In this way, tokens are routed through a subset of redundant MLP layers rather than all of them. This design efficiently leverages open-source LLMs with low additional computational overhead, enhancing model performance while reducing active parameters. Extensive experiments show that Dense2MoE optimizes the Pareto frontier of efficiency versus accuracy compared with original seed models, and achieves a superior trade-off between efficiency and effectiveness relative to alternative approaches.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 10936
Loading