Grounding Deliberate Reasoning in Multimodal Large Language Models

Published: 01 Jan 2025, Last Modified: 15 May 2025MMM (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The rise of Multimodal Large Language Models, renowned for their advanced instruction-following and reasoning capabilities, has significantly propelled the field of visual reasoning. However, due to limitations in their image tokenization, most MLLMs struggle to capture fine details of text and objects in images, especially in high-resolution samples. To overcome this limitation, we introduce \(\textsc {P}^2\textsc {G}\), a novel framework for grounding reasoning in MLLMs. \(\textsc {P}^2\textsc {G}\) utilizes the tool-usage potential of MLLMs to employ expert agents for on-the-fly grounding of reasoning into critical visual and textual elements in images, thereby enabling deliberate reasoning through multimodal prompting. Additionally, we develop \(\textsc {P}^2\textsc {GB}\), a benchmark designed to evaluate MLLMs’ proficiency in understanding inter-object relationships and textual content in challenging high-resolution images. Extensive experiments on visual reasoning tasks demonstrate the superiority of \(\textsc {P}^2\textsc {G}\), achieving performance comparable to GPT-4V on \(\textsc {P}^2\textsc {GB}\) with a 7B backbone. Our work underscores the potential of grounding reasoning with external agents in MLLMs, presenting a promising alternative to mere model scaling.
Loading