Task-Aware Mechanism: Hybrid MoE Vision Tower Towards Holistic Video Understanding

Qishen Yin; Tanghui Jia; Peng Jin; Li Hao; Juntong Wu; Guanlin Lu; Beili Tang; Li Yuan

Task-Aware Mechanism: Hybrid MoE Vision Tower Towards Holistic Video Understanding

Qishen Yin, Tanghui Jia, Peng Jin, Li Hao, Juntong Wu, Guanlin Lu, Beili Tang, Li Yuan

15 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Understanding; Multimodal Large Language Model; Large vision-language models; Mixture of Experts;

TL;DR: We propose Task-Aware Mechanism that employs Hybrid Gating Strategy to endow MoE Vision Tower. TAM could intelligently determines the appropriate task category, the number of frames to sample, and the optimal resolution based on user's query.

Abstract: Does *Comprehending the main idea of a 2-hour movie* and *Counting the birds appearing in a 15-second clip* really warrant the same video processing pipeline? Recent successes of Mixture-of-Experts (MoE) architectures in language modeling have inspired explorations of MoE applications. However, existing MoE models mainly focus on Large Language Models (LLMs) while neglecting Vision Tower (VT) in multimodal models. MoE-LLMs are predominantly designed for capacity scaling, whereas VT contains three fundamentally distinct modules, indicating that directly copying MoE-LLM designs to VT is unlikely to be effective. Inspired by the emerging Task-Aware idea, we argue that MoE-VT architectures should embody the principle of *Right Tool for the Right Job*, providing suitable processing to different tasks. To address this, we propose Task-Aware Mechanism (TAM), a MoE-VT architecture that employs Hybrid Gating Strategy to endow VT with intrinsic Task-Aware ability. To equip the framework with task-aware capabilities, we further introduce a compact Inductor module with only 0.1B parameters, trained on our new dataset TA-116k. With the Inductor, TAM could dynamically determine the appropriate task category, the optimal resolution and number of frames to sample, based on the user query and the length of video. Leveraging TAM, we introduce the TallVA-8B-A7B model, which outperforms current SOTA methods across various benchmarks on comparable LLMs, demonstrating that TAM enables video understanding models to become more holistic on diverse tasks.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 6041

Loading