Towards Democratizing LLMs: Investigating Multilingual Mixture-of-Experts Models

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multilingual LLMs, Mixture-of-Experts, Routing
Abstract: Large Language Models (LLMs) achieve impressive performance on high-resource languages but often underperform on underrepresented ones due to data imbalances. Mixture-of-Experts (MoE) architectures offer a promising solution by dynamically allocating computation to different sub-networks, potentially enabling more equitable multilingual learning. In this work, we investigate whether language-specialized experts emerge naturally in decoder-only MoE models under continual multilingual pretraining. Building on OLMoE checkpoints, we train on a curated multilingual corpus spanning both high-resource and low-resource languages, with an emphasis on typological and script diversity. Our intrinsic analyses of expert routing patterns reveal an emergent modularity: early layers function as general-purpose experts, while later layers develop strong, language-specific specialization. Importantly, low-resource languages tend to reuse experts associated with their high-resource counterparts, especially when they share scripts or tokenization schemes. This mechanism provides an efficient pathway for knowledge transfer, suggesting that MoEs can implicitly encode cross-lingual structure. These findings highlight conditional computation as a scalable and linguistically adaptive framework for inclusive multilingual modeling.
Submission Number: 167
Loading