Continual Pre-training of MoEs: How robust is your router?

Benjamin Thérien; Charles-Étienne Joseph; Zain Sarwar; Ashwinee Panda; Anirban Das; Shi-Xiong Zhang; Stephen Rawls; Sambit Sahu; Eugene Belilovsky; Irina Rish

Continual Pre-training of MoEs: How robust is your router?

Benjamin Thérien, Charles-Étienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, Irina Rish

Published: 23 Sept 2025, Last Modified: 11 Nov 2025CCFM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Continual Pre-training, MoEs, Mixture of Experts, LLM, Pre-training, Rewarm, Deepseek MoE, Switch MoE, Sinkhorn Routing

TL;DR: In the context of large-scale LLM pre-training, we study the continual pre-training performance of multiple popular MoE architectures relative to a FLOP-matched dense baseline.

Abstract: Prior work has shown that a simple combination of replay and learning rate re-warming and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) do the MoE transformer's routers exacerbate forgetting relative to a dense model?; 2) do the routers maintain a balanced load on previous distributions after CPT?; 3) are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs? In what follows, we conduct a large-scale ($>2$B parameter switch and DeepSeek MoE LLMs trained for $600$B tokens) empirical study across four MoE transformers to answer these questions. Our results establish a surprising robustness to distribution shifts for MoEs using both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.

Serve As Reviewer: ~Benjamin_Thérien1

Submission Number: 1

Loading