Keywords: Mixture-of-Experts, Clustering, Robustness
TL;DR: We propose a novel Mixture-of-Experts routing method that computes token-expert assignments in a transformed space that promotes separation of latent clusters in the data and more easily identifies the best-matched expert for each token.
Abstract: At the core of Sparse Mixture-of-Experts (MoE) models is the router that learns the clustering structure of the input distribution in order to direct tokens to suitable experts. However these latent clusters may be unidentifiable, causing slow convergence, vulnerability to contamination, and degraded representations. We examine the router through the lens of clustering optimization, deriving optimal feature weights that maximally distinguish these clusters. Using these weights, we compute token-expert assignments in an adaptively transformed space that better separates clusters, helping identify the best-matched expert for each token. In particular, for each expert cluster, we compute weights that scale features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router confers three connected benefits: 1) faster convergence, 2) better robustness, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router in language modeling and image classification in both clean and corrupted settings.
Submission Number: 37
Loading