Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model

Chaoxiang Cai; Longrong Yang; Fan Yang; Zequn Qin; Xi Li

Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model

Chaoxiang Cai, Longrong Yang, Fan Yang, Zequn Qin, Xi Li

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: long-tailed distribution, vision-language model, mixture-of-experts, modality-specific routing

Abstract: The mixture-of-experts (MoE) architecture, which replaces dense architectures with sparse ones, has garnered attention in large vision-language models (LVLMs) for achieving comparable performance with fewer activated parameters. Existing MoE architectures for LVLMs primarily focus on token-to-expert routing (TER), encouraging different experts to specialize in processing specific tokens. However, these architectures typically rely on the load balancing mechanism, neglecting the inherent distributional differences between vision and language modalities. To address this, we propose the Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, which tackles two key challenges: (1) Modality-specific distribution-aware routing. We observe that language TER follows a relatively uniform distribution, whereas vision TER exhibits a long-tailed distribution. This modality discrepancy necessitates specific routing strategies for each modality. (2) Vision-specific expert activation. Recognizing the importance of high-information vision tail tokens, we introduce an oversampling-like strategy by increasing the number of activated experts to sufficiently learn vision tail token representations. Experiments on extensive vision-language and vision benchmarks validate the effectiveness of our approach.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 13490

Loading