Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential ``determining hallucinations'' in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models (MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work has contributed to multimedia and multimodal processing by advancing the field of visual reasoning. Visual reasoning is a multimodal reasoning task that requires fine-grained, in-depth visual understanding, and reasoning ability to combine with text. This work can: (1) improve the computer's understanding of complex scenes: by deeply analyzing and integrating visual and textual information, it helps to enhance the computer's understanding of multi-level and multi-dimensional information in images, thereby more accurately identifying and parsing complex scenes. (2) Providing a new perspective on solving complex problems: This work provides a new methodology for understanding and processing complex data problems from multiple information sources through cross-modal information fusion and inference, which can be applied in fields such as intelligent auxiliary systems and strategic planning.
Supplementary Material: zip
Submission Number: 418
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview