Keywords: Large Language Models, Mixture of Experts
Abstract: The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs).
MoE models facilitate scalability by enabling sparse parameter activation.
However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency.
To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture.
This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead.
Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training.
GroveMoE models dynamically activate $3.14 \text{-} 3.28$B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3351
Loading