Abstract: Multimodal large language models (MLLMs) perform excellently in cross-modal tasks, but their spatial understanding capabilities are still far from human-level performance, and existing prompt learning methods have not fully unlocked their potential. Therefore, we propose a fine-grained image-text dual prompt learning framework aimed at enhancing the spatial understanding ability of MLLMs. Our method utilizes three mechanisms—target detection, image segmentation, and attention visualization—to provide fine-grained prompts for the input image from different angles, and employs an LLM-based refined Chain of Thought method to transform the text into fine-grained prompts. This approach strengthens the interaction between the image and text prompts, facilitating a deeper semantic analysis by MLLMs. We evaluate our proposed method using the BLINK dataset, with two tasks—counting and relative depth judgment—that effectively assess spatial understanding capabilities. Experimental results show that MLLMs prompted by our method demonstrate significant improvement in both tasks, which strongly validates the effectiveness of our approach.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: prompt learning, multimodal large language models, fine-grained
Languages Studied: English
Submission Number: 1853
Loading