Keywords: Continual Pre-training, MoEs, Mixture of Experts, LLM, Pre-training, Rewarm, Deepseek MoE, Switch MoE, Sinkhorn Routing
TL;DR: In the context of large-scale LLM pre-training, we study the continual pre-training performance of multiple popular MoE architectures relative to a FLOP-matched dense baseline.
Abstract: Prior work has shown that a simple combination of replay and learning rate re-warming and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) do the MoE transformer's routers exacerbate forgetting relative to a dense model?; 2) do the routers maintain a balanced load on previous distributions after CPT?; 3) are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs? In what follows, we conduct a large-scale ($>2$B parameter switch and DeepSeek MoE LLMs trained for $600$B tokens) empirical study across four MoE transformers to answer these questions. Our results establish a surprising robustness to distribution shifts for MoEs using both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.
Serve As Reviewer: ~Benjamin_Thérien1
Submission Number: 1
Loading