DARE:Difficulty-Aware Dynamic Routing for Mixture of Experts

Zhou Tao; YongXiang Hua; Chaohu Liu; Shida Wang; Linli Xu

DARE:Difficulty-Aware Dynamic Routing for Mixture of Experts

Zhou Tao, YongXiang Hua, Chaohu Liu, Shida Wang, Linli Xu

17 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MLLM, MoE, Difficulty

TL;DR: We propose DARE, a difficulty-aware routing strategy for Sparse MoE models that dynamically allocates experts based on token complexity, improving both performance and efficiency in vision-language tasks.

Abstract: Sparse Mixture-of-Experts (MoE) architectures have become a foundational approach for efficiently scaling Large Vision-Language Models (LVLMs), as they activate only a subset of parameters for each input. However, the commonly adopted Top-K routing strategy assigns a fixed number of experts to every token, ignoring the natural variation in token complexity. This static allocation often results in suboptimal resource utilization, where simple tokens receive excessive computation and complex tokens are insufficiently processed. While recent dynamic routing methods attempt to address this limitation, they lack principled mechanisms to explicitly guide expert allocation based on token-level difficulty, resulting in suboptimal performance in practice. In this paper, we propose \textbf{D}ifficulty-\textbf{A}ware Dynamic \textbf{R}outing for Mixture of \textbf{E}xperts (\textbf{DARE}), a novel routing strategy that adapts expert selection according to the complexity of each token. DARE %incorporates introduces a lightweight predictor that estimates the difficulty of individual tokens based on their log-perplexity as a theoretically grounded proxy, and employs a set of learnable thresholds to dynamically determine the appropriate number of experts to activate. This mechanism enables fine-grained and adaptive allocation of computational resources, allowing the model to devote more capacity to challenging tokens while conserving resources on easier ones. Extensive experiments on standard vision-language benchmarks demonstrate that DARE consistently outperforms both fixed Top-K routing and existing adaptive routing strategies. It achieves superior task performance while simultaneously improving computational efficiency, highlighting the effectiveness and generality of difficulty-aware routing in sparse MoE architectures for large-scale multimodal models.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 8748

Loading