Q-MoE: Connector for MLLMs with Text-Driven Routing

Hanzi Wang, Jiamin Ren, Yifeng Ding, Lei Ren, Huixing Jiang, Wei Chen, Fangxiang Feng, Xiaojie Wang

Published: 01 Jan 2024, Last Modified: 15 Feb 2025ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal Large Language Models (MLLMs) have showcased remarkable advances in handling various vision-language tasks. These models typically consist of a Large Language Model (LLM), a vision encoder and a connector structure, which is used to bridge the modality gap between vision and language. It is challenging for the connector to filter the right visual information for LLM according to the task in hand. Most of previous connectors, such as light-weight projection and Q-former, treat visual information for diverse tasks uniformly, therefore lacking task-specific visual information extraction capabilities. To address the issue, this paper proposes Q-MoE, a query-based connector with Mixture-of-Experts (MoE) to extract task-specific information with text-driven routing. Furthermore, an optimal path based training strategy is proposed to find an optimal expert combination. Extensive experiments on two popular open-source LLMs and several different visual-language tasks demonstrate the effectiveness of the Q-MoE connecter.