Abstract: Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE. Code: https://github.com/JieShibo/MoLE.
Lay Summary: The Mixture of Experts (MoE) architecture is a mainstream design for today’s large language models (LLMs). When generating a word, an MoE model activates only a small subset of its numerous expert modules. However, a key challenge arises: although only a few experts are used at each step, all experts typically need to reside in GPU memory. This makes it difficult to deploy such models on devices with limited memory, such as smartphones or personal PCs.
A common workaround is to store experts on lower-tier memory, such as disk, and load the required experts into GPU memory on demand. While this approach reduces memory usage, it incurs significant data transfer overhead, which drastically slows down inference.
To address this, we propose a new architecture that transforms the computation of experts into a lookup process. This design allows all computations to be completed without loading the experts into GPU memory, thereby reducing memory usage while avoiding the latency caused by large-scale parameter transfers.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/JieShibo/MoLE
Primary Area: Deep Learning->Large Language Models
Keywords: Mixture-of-Experts, Large Language Models
Submission Number: 626
Loading