Q-MoE: Connector for MLLMs with Text-Driven Routing

Hanziwang; Jiamin Ren; Yifeng Ding; Lei Ren; Huixing Jiang; Chen Wei; Fangxiang Feng; Xiaojie Wang

Q-MoE: Connector for MLLMs with Text-Driven Routing

Hanziwang, Jiamin Ren, Yifeng Ding, Lei Ren, Huixing Jiang, Chen Wei, Fangxiang Feng, Xiaojie Wang

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Large Language Models (MLLMs) have showcased remarkable advances in handling various vision-language tasks. These models typically consist of a Large Language Model (LLM), a vision encoder and a connector structure, which is used to bridge the modality gap between vision and language. It is challenging for the connector to filter the right visual information for LLM according to the task in hand. Most of the previous connectors, such as light-weight projection and Q-former, treat visual information for diverse tasks uniformly, therefore lacking task-specific visual information extraction capabilities. To address the issue, this paper proposes Q-MoE, a query-based connector with Mixture-of-Experts (MoE) to extract task-specific information with text-driven routing. Furthermore, an optimal path based training strategy is also proposed to find an optimal expert combination. Extensive experiments on two popular open-source LLMs and several different visual-language tasks demonstrate the effectiveness of the Q-MoE connecter. We will open our codes upon publication.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Content] Multimodal Fusion

Relevance To Conference: This work focuses on the connector, which bridges the modality gap between text and vision in Multimodal Large Language Models(MLLMs). We produce an innovative structure, Q-MoE, a query-based connector with Mixture-of-Experts (MoE) to extract task-specific information with text-driven routing. Q-MoE enables text-driven routing of inputs, facilitating finer-grained specialized visual information processing. Moreover, Q-MoE utilizes the optimal path based training strategy to find the optimal expert combination. The experimental results on several different visual-language tasks demonstrate the effectiveness of the Q-MoE connecter.

Supplementary Material: zip

Submission Number: 3615

Loading