Keywords: Large Multimodal Models; Multimodal Chain-of-Thought; In-Context Learning
Verify Author List: I have double-checked the author list and understand that additions and removals will not be allowed after the submission deadline.
TL;DR: This paper proposes a plug-and-play Dual Chain-of-Thought strategy, a novel pipeline that combines visual and textual guidance to improve the performance of LMMs in complex multimodal tasks.
Abstract: Inference augmentation techniques such as Chain-of-Thought have already made their mark in Large Language Models (LLMs). However, transferring these advances to Large Multimodal Models (LMMs) presents greater challenges. Drawing inspiration from human cognitive processes, this paper proposes a plug-and-play Dual Chain-of-Thought strategy, a novel pipeline that combines visual and textual guidance to improve the performance of LMMs in complex multimodal tasks. The DCoT strategy uses a dual guidance mechanism to use bounding box markers to guide the model's attention to the image region related to the query problem in the visual aspect, so as to achieve fine-grained image guidance, and in the text aspect, we propose a Fast In-Context Retrieval Framework (FICRF) dynamically and automatically obtains the most suitable examples from the well-built demonstration example cluster as context guidance according to the current problem. This bimodal approach that utilizes visual and textual guidance enhances the inference capabilities of LMMs. Extensive experiments on different LMMs and benchmark datasets have validated its effectiveness, opening up a new path in multimodal inference. Showcasing how the synergistic combination of visual and textual instructions can take the performance of these models to new heights, while demonstrating the potential of Chain-of-Thought and In-Context Learning as a superior alternative to the fine-tuning of LMMs.
A Signed Permission To Publish Form In Pdf: pdf
Supplementary Material: pdf
Primary Area: Deep Learning (architectures, deep reinforcement learning, generative models, deep learning theory, etc.)
Paper Checklist Guidelines: I certify that all co-authors of this work have read and commit to adhering to the guidelines in Call for Papers.
Student Author: No
Submission Number: 313
Loading