A scalable cooperative/competitive splitting scheme for mixture of experts models

A scalable cooperative/competitive splitting scheme for mixture of experts models

ICLR 2026 Conference Submission20540 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: hierarchical mixture of experts, expectation maximization, optimization

Abstract: We present a novel probabilistic expectation-maximization scheme for training hierarchical mixture-of-experts models that both exposes and exploits parallelism during training. By replacing the typical categorical distribution used in gating networks with a joint distribution blending cooperative and competitive mechanisms, we obtain a likelihood that encodes both global and local interactions between experts. The application of an M-splitting scheme reveals an M-step that enables the solution of localized, embarrassingly parallel subproblems governing local experts, with deferred corrections accounting for global coupling between experts. When combined with a hierarchical decomposition of nested networks, this yields a fast multi-level training scheme reminiscent of multigrid algorithms, which avoids under-utilization of experts, exposes further GPU parallelism and outperforms standard models on regression tasks. We provide experiments using a scalable GPU implementation that demonstrate rapid convergence and parallel scalability of the iterative scheme, as well as strong localization of the model for non-smooth, high-dimensional regression problems.

Supplementary Material: zip

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Submission Number: 20540

Loading