Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts

Ashwinee Panda; Vatsal Baherwani; Zain Sarwar; Benjamin Thérien; Stephen Rawls; Sambit Sahu; Supriyo Chakraborty; Tom Goldstein

Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts

Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Thérien, Stephen Rawls, Sambit Sahu, Supriyo Chakraborty, Tom Goldstein

Published: 09 Oct 2024, Last Modified: 19 Nov 2024Compression Workshop @ NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixtureofexperts, MoE, routing, transformer, LLM

Abstract: Sparsely-gated Mixture-of-Experts (MoEs) have proven to be more efficient than dense Transformers because they can dynamically activate a subset of their overall parameters by routing tokens to selected experts, allowing practitioners to scale up model parameter counts without significantly increasing total compute. However, current MoE training approaches only update the router with a sparse gradient and suffer from issues such as load imbalance. We propose a new router that can receive a dense gradient update from a sparse forward pass. Our method adds minimal overhead, but improves on the common Top-K routing in both performance and load balance.

Submission Number: 74

Loading