MoEsturizer: Resource-Efficient MoE Upcycling for Small Language Models

ICLR 2026 Conference Submission24957 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture-of-Experts (MoE), Model upcycling, Small language models (SLMs), Resource-constrained training
TL;DR: 150k samples, one 96GB GPU: upcycling small LMs to sparse MoEs (Experts-Top K: 4-2/8-2) beats dense bases on 9 benchmarks and rivals larger tiers at far lower active parameters; depth scaling or higher top-k adds little.
Abstract: Large language models (LLMs) are typically scaled through billions of parameters and trillions of tokens, making progress largely restricted to organizations with substantial resources. Recent work on Mixture-of-Experts (MoE) upcycling shows that dense pretrained models can be transformed into sparse MoE variants, but prior studies have focused on large models and required extensive additional training. In this work, we demonstrate that MoE upcycling is also effective for small language models (sub-billion parameters) using only a few hundred thousand samples of supervised fine-tuning. Remarkably, upcycled models consistently outperform their dense base models and remain competitive with dense counterparts of equivalent total size, despite activating fewer parameters at inference. Our study highlights MoE upcycling as a lightweight and practical scaling strategy, while providing empirical insights into its efficiency and limitations. These results establish MoE upcycling as a reproducible pathway for enhancing small models under realistic resource budgets, broadening access to language model improvement.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24957
Loading