Training Mixture-of-Experts: A Focus on Expert-Token Matching

Published: 19 Mar 2024, Last Modified: 23 Apr 2024Tiny Papers @ ICLR 2024 NotableEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Transformer, Mixture-of-Experts Models, Token Routing
TL;DR: We present an effective recipe for the training of VMoE (a sparse variant of the Vision Transformer), using the Sinkhorn algorithm to enhance the token-expert matching process.
Abstract: Recent advancements in sparse Mixture-of-Experts (MoE) models, particularly in the Vision MoE (VMoE) framework, have demonstrated promising results in enhancing vision task performance. However, a key challenge persists in optimally routing tokens (such as image patches) to the right experts, without incurring excessive computational costs. Addressing this, we apply the regularized optimal transport, which relies on the Sinkhorn algorithm to the Vision MoE (VMoE) framework, aiming at improving the token-expert matching process. The resulting model, Sinkhorn-VMoE (SVMoE), represents a meaningful step in optimizing efficiency and effectiveness of sparsely-gated MoE models.
Submission Number: 123
Loading