MoRE: Mixture of Reused Experts
Keywords: Mixture-of-Experts, Weight Sharing, Weight Symmetry, Large Language Models, Transformer Architectures
TL;DR: We propose MoRE, which shares expert weights across layers using depth conditioning. This expands the routing space without memory overhead, outperforming standard MoEs and recurrent baselines matched in parameters and compute.
Abstract: Standard Mixture-of-Experts (MoE) architectures maintain a separate pool of experts at every layer, with no weight-space tying between experts at different depths. We propose Mixture of Reused Experts (MoRE), which introduces such a tying symmetry by sharing expert feedforward parameters across adjacent layers, and conditioning the input to those shared experts with a small learnable depth embedding. Sharing alone enables MoRE to reuse a larger routing pool at constant expert parameter cost; the depth embedding enables additional gains that come from letting the same shared experts specialize by layer. Experiments across three model scales (114M–1.15B parameters) show that MoRE consistently achieves lower perplexity and stronger downstream performance than standard MoEs and state-of-the-art weight-sharing architectures at matched compute and parameter budgets, with only minimal modifications to existing MoE implementations.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 41
Loading