Abstract: Large language models have become indispensable for many text-processing applications. Their inference, i.e. their use to generate text, is a time-consuming task since tokens have to be generated one after the other, even if the computational load has been reduced by model sparsification, e.g. by using a Mixture of Experts (MoE) models. In the MoE context, a subset of experts is selected at each stage. Note that not all subsets of experts (pairs of experts in most cases) in a given layer have the same probability of being selected. When experts are mapped to different GPUs, there is a risk of load imbalance if the selected experts end up on a small number of GPUs. This paper proposes to leverage this heterogeneity in expert usage to map experts of popular subsets onto distinct GPUs, allowing them to be processed in parallel and thus reducing the time needed for inference. Even though this mapping problem is NP-complete, it is possible to design simple greedy strategies that significantly reduce the need for sequential expert processing. Our proof-ofconcept confirms that our mapping strategies effectively reduce inference time on the Mixtral model.
Loading