Enhancing Spatial Understanding in MLLMs via Fine-Grained Image-Text Dual Prompt Learning

Enhancing Spatial Understanding in MLLMs via Fine-Grained Image-Text Dual Prompt Learning

ACL ARR 2024 December Submission1853 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal large language models (MLLMs) perform excellently in cross-modal tasks, but their spatial understanding capabilities are still far from human-level performance, and existing prompt learning methods have not fully unlocked their potential. Therefore, we propose a fine-grained image-text dual prompt learning framework aimed at enhancing the spatial understanding ability of MLLMs. Our method utilizes three mechanisms—target detection, image segmentation, and attention visualization—to provide fine-grained prompts for the input image from different angles, and employs an LLM-based refined Chain of Thought method to transform the text into fine-grained prompts. This approach strengthens the interaction between the image and text prompts, facilitating a deeper semantic analysis by MLLMs. We evaluate our proposed method using the BLINK dataset, with two tasks—counting and relative depth judgment—that effectively assess spatial understanding capabilities. Experimental results show that MLLMs prompted by our method demonstrate significant improvement in both tasks, which strongly validates the effectiveness of our approach.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: prompt learning, multimodal large language models, fine-grained

Languages Studied: English

Submission Number: 1853

Loading