Abstract: Multimodal Large Language Models (MLLMs) have showcased remarkable advances in handling various vision-language tasks. These models typically consist of a Large Language Model (LLM), a vision encoder and a connector structure, which is used to bridge the modality gap between vision and language. It is challenging for the connector to filter the right visual information for LLM according to
the task in hand. Most of the previous connectors, such as light-weight projection and Q-former, treat visual information for diverse tasks uniformly, therefore lacking task-specific visual information extraction capabilities. To address the issue, this paper proposes Q-MoE, a query-based connector with Mixture-of-Experts (MoE) to extract task-specific information with text-driven routing. Furthermore,
an optimal path based training strategy is also proposed to find an optimal expert combination. Extensive experiments on two popular open-source LLMs and several different visual-language tasks demonstrate the effectiveness of the Q-MoE connecter. We will open our codes upon publication.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This work focuses on the connector, which bridges the modality gap between text and vision in Multimodal Large Language Models(MLLMs). We produce an innovative structure, Q-MoE, a query-based connector with Mixture-of-Experts (MoE) to extract task-specific information with text-driven routing. Q-MoE enables text-driven routing of inputs, facilitating finer-grained specialized visual information processing. Moreover, Q-MoE utilizes the optimal path based training strategy to find the optimal expert combination. The experimental results on several different visual-language tasks demonstrate the effectiveness of the Q-MoE connecter.
Supplementary Material: zip
Submission Number: 3615
Loading