Token-Complexity based Routing Technique within Mixture of Experts Architecture for Large Language Model
Keywords: Mixture of Experts, Large Language Model, Router, Token Complexity Threshold.
TL;DR: Mixture of Expert Architecture for Large Language Models to enhance scaling and performance
Abstract: Mixture-of-Experts (MoE) architectures have emerged as a powerful technique for improving and scaling Large Language Models by conditionally activating Feed Forward subnetworks and distributing tokens through a routing system, within the Transformer layers. However, existing MoE methods often rely on static top-k routing strategies that do not involve token-level variability in complexity, leading to suboptimal expert utilization. In this research, we propose a novel token-complexity-based routing framework that dynamically allocates tokens to either lightweight or strong feedforward networks (FFNs) based on their estimated token complexity. Our router is trained using a few-shot classification objective to distinguish between easy and complex tokens and surrogate neural network layer. The efficacy of the framework is evaluated while integrating the router with Mistral-7B and Llama-2-7B model. We evaluate our approach on several benchmarks from various fields, and our proposed MoE framework improves accuracy up to 12% compared to the state-of-the-art results using different MoE architecture, with reasonable computational cost.
Primary Area: generative models
Submission Number: 13468
Loading