Keywords: Mixture-of-Experts, Finer-Grained Expert, Upcycling
Abstract: Fine-grained expert design has hitherto been restricted to the intermediate dimension of MoE layers, with its potential at the output dimension remaining largely unexplored, primarily due to the accompanying dimension discrepancy in subsequent computations after the MoE layers. Drawing on the power of multi-head attention, we pioneer the FineRMoE (FineR-grained MoE) architecture to expand fine-grained expert design across both intermediate and output dimensions for further enhancing expert specialization. FineRMoE introduces a bi-level sparsity paradigm: a sparse sum layer produces dimension-reduced candidate vectors for each token through its activated experts, and a sparse concatenation layer subsequently reassembles a dimension-restored output by selectively concatenating the chosen candidate vectors. Despite the bi-level sparsity, we devise a specialized routing mechanism that uses only a single router network to govern both expert activation and candidate selection, eliminating the extra computational cost of adopting two distinct routers. Meanwhile, to obviate the prohibitive cost of training FineRMoE from scratch, we adopt the upcycling paradigm for efficient expert construction and training. Nonetheless, existing upcycling methods are tailored to single-layer additive-fusion MoE architectures, and therefore not applicable to FineRMoE. To this end, we propose an upcycling method, which is compatible with prevailing ones, to accomplish FineRMoE in a cost-effective manner. By enabling flexible partition and expansion of pre-trained FFNs along the intermediate and output dimensions, the upcycling method promotes a broad adaptability in converting dense models into MoE models. Experimentally, we build the FineRMoE, in which 2 experts are sparsely activated out of 128 experts, based on Qwen2.5 with sizes of 0.5B, 1.5B and 7B via the proposed upcycling method. After continued training on 50B tokens, in comparison with baselines, FineRMoE exhibits superior performance across ten standard benchmarks, as well as remarkable efficiency in both parameters and inference. Extensive experiments validate the effectiveness of our FineRMoE architecture and upcycling method.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4036
Loading