Keywords: mixture-of-experts, sample efficiency
TL;DR: We introduce MoEP (Modular Expert Paths) as a solution to add sparsity while keeping the total parameter count fixed. MoEP combines model parallelism with MoE-style linear projections to implement selective token activation
Abstract: The transition from dense model architectures to sparse ones has become a key trend in the field of Large Language Models (LLMs). Using methods like Mixture-of-Experts (MoE) allows language models to scale their representation power without overloading computation, by relying on sparse parameter activation. Despite this more lightweight activation, the standard MoE approach increases the total number of parameters. This trade-off between size and sparsity can be avoided without losing performance compared to a dense baseline architecture. We introduce MoEP (Modular Expert Paths) as a solution to add sparsity while keeping the total parameter count fixed. MoEP combines model parallelism with MoE-style linear projections to implement selective token activation, which accelerates model learning and enables it to outperform the GPT-2 baseline. This opens a promising research direction, where compact models can still benefit from sparsity.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24465
Loading