Keywords: hierarchical mixture of experts, expectation maximization, optimization
Abstract: We present a novel probabilistic expectation-maximization scheme for training hierarchical mixture-of-experts models that both exposes and exploits parallelism during training. By replacing the typical categorical distribution used in gating networks with a joint distribution blending cooperative and competitive mechanisms, we obtain a likelihood that encodes both global and local interactions between experts. The application of an M-splitting scheme reveals an M-step that enables the solution of localized, embarrassingly parallel subproblems governing local experts, with deferred corrections accounting for global coupling between experts. When combined with a hierarchical decomposition of nested networks, this yields a fast multi-level training scheme reminiscent of multigrid algorithms, which avoids under-utilization of experts, exposes further GPU parallelism and outperforms standard models on regression tasks. We provide experiments using a scalable GPU implementation that demonstrate rapid convergence and parallel scalability of the iterative scheme, as well as strong localization of the model for non-smooth, high-dimensional regression problems.
Supplementary Material: zip
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Submission Number: 20540
Loading