Keywords: Mixtureofexperts, MoE, routing, transformer, LLM
Abstract: Sparsely-gated Mixture-of-Experts (MoEs) have proven to be more efficient than dense Transformers because they can dynamically activate a subset of their overall parameters by routing tokens to selected experts, allowing practitioners to scale up model parameter counts without significantly increasing total compute. However, current MoE training approaches only update the router with a sparse gradient and suffer from issues such as load imbalance. We propose a new router that can receive a dense gradient update from a sparse forward pass. Our method adds minimal overhead, but improves on the common Top-K routing in both performance and load balance.
Submission Number: 74
Loading