Keywords: vision-language model, decision-making, multimodal reasoning
Abstract: Vision-Language Models (VLMs) are promising in decision-making tasks, whereas the visual hallucination issue limits their performance in complex visual scenes. In such scenes, a number of visual objects exist, while the essential ones related to actions require focus in each step, avoiding the interference of unrelated objects. In this work, we propose SceneDiver, a coarse-to-fine, two-stage focus plan generation pipeline, to tackle the key technical challenge of identifying the essential objects from scenes with complicated visual and semantic structures. First, the VLM executes a virtual, coarse-grained plan over the scene graph. Then, it zooms into local neighborhoods around each graph node to perform fine-grained focusing. The resulting focus map controls the attentions of VLM during decision making, steering the model toward task-critical objects and alleviating the perceptual hallucination of VLMs. Experimental results under robotic manipulation and room navigation benchmarks demonstrate that our approach successfully overcomes the perceptual limitation of VLMs, meanwhile significantly enhances their decision-making performance and generalization ability. Our code will be open-released upon acceptance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4289
Loading