DenseMixer: Improving MoE Post-Training with Precise Router Gradient

ICLR 2026 Conference Submission22322 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: MoE, Post-Training
TL;DR: We propose DenseMixer, a plug-and-play MoE post-training method that improves router gradients and training quality.
Abstract: Mixture-of-Experts (MoE) models are notoriously harder to train compared with dense models. Existing approaches either rely on imprecise router gradient or freeze router parameters entirely, limiting training effectiveness. We introduce DenseMixer, a novel MoE post-training technique that trades one extra forward pass on inactive experts for a more precise router gradient estimation. Our method consistently outperforms conventional methods across different MoE scales (7B, 14B, 16B, 30B), architectures (with/without shared experts), pre-training methods (from scratch/up-cycling), and post-training data types (instruction/long CoT data). It is universally applicable to any MoE using Top-K routing and can be used in a plug-and-play manner, compatible with existing training libraries and parameter-efficient methods like LoRA, introducing no changes to inference. We provide comprehensive empirical validation showing DenseMixer's effectiveness in improving MoE post-training quality while maintaining practical computational overhead.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22322
Loading