Unbiased Gradient Estimation with Balanced Assignments for Mixtures of Experts

Wouter Kool; Chris J. Maddison; Andriy Mnih

Unbiased Gradient Estimation with Balanced Assignments for Mixtures of Experts

Wouter Kool, Chris J. Maddison, Andriy Mnih

Published: 09 Dec 2021, Last Modified: 05 May 2023ICBINB@NeurIPS2021 ContributedTalkReaders: Everyone

Keywords: mixture, experts, gumbel, matching, balanced, assignment, skipping, gradient, estimation, unbiased, mixture of experts, gumbel-matching, gumbel-max

TL;DR: Two unbiased estimators for training mixture of experts with per-expert capacity constraints based on skipping datapoints or balanced assignments using Gumbel-Matching.

Abstract: Training large-scale mixture of experts models efficiently on modern hardware requires assigning datapoints in a batch to different experts, each with a limited capacity. Recently proposed assignment procedures lack a probabilistic interpretation and use biased estimators for training. As an alternative, we propose two unbiased estimators based on principled stochastic assignment procedures: one that skips datapoints which exceed expert capacity, and one that samples perfectly balanced assignments using an extension of the Gumbel-Matching distribution [29]. Both estimators are unbiased, as they correct for the used sampling procedure. On a toy experiment, we find the `skip'-estimator is more effective than the balanced sampling one, and both are more robust in solving the task than biased alternatives.

Category: Negative result: I would like to share my insights and negative results on this topic with the community

1 Reply

Loading