MoEP: Compact and Efficient Sparsity with Modular Expert Paths

Joonas Tapaninaho; Mourad Oussalah

MoEP: Compact and Efficient Sparsity with Modular Expert Paths

Joonas Tapaninaho, Mourad Oussalah

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: mixture-of-experts, sample efficiency

TL;DR: We introduce MoEP (Modular Expert Paths) as a solution to add sparsity while keeping the total parameter count fixed. MoEP combines model parallelism with MoE-style linear projections to implement selective token activation

Abstract: The transition from dense model architectures to sparse ones has become a key trend in the field of Large Language Models (LLMs). Using methods like Mixture-of-Experts (MoE) allows language models to scale their representation power without overloading computation, by relying on sparse parameter activation. Despite this more lightweight activation, the standard MoE approach increases the total number of parameters. This trade-off between size and sparsity can be avoided without losing performance compared to a dense baseline architecture. We introduce MoEP (Modular Expert Paths) as a solution to add sparsity while keeping the total parameter count fixed. MoEP combines model parallelism with MoE-style linear projections to implement selective token activation, which accelerates model learning and enables it to outperform the GPT-2 baseline. This opens a promising research direction, where compact models can still benefit from sparsity.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24465

Loading