Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Haoyuan Wu; Haoxing Chen; Xiaodong Chen; Zhanchao Zhou; Tieyuan Chen; yihong zhuang; Guoshan Lu; Zenan Huang; Junbo Zhao; Lin Liu; Zhenzhong Lan; Bei Yu; Jianguo Li

Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, yihong zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, Jianguo Li

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Mixture of Experts

Abstract: The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate $3.14 \text{-} 3.28$B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 3351

Loading