Dive into the Scene: Autonomous Focus Plan Generation in Vision-Language Decision-Making

Boyuan Xiao; Bohong Chen; Yumeng Li; Yao-Xiang Ding; Kun Zhou

Dive into the Scene: Autonomous Focus Plan Generation in Vision-Language Decision-Making

Boyuan Xiao, Bohong Chen, Yumeng Li, Yao-Xiang Ding, Kun Zhou

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language model, decision-making, multimodal reasoning

Abstract: Vision-Language Models (VLMs) are promising in decision-making tasks, whereas the visual hallucination issue limits their performance in complex visual scenes. In such scenes, a number of visual objects exist, while the essential ones related to actions require focus in each step, avoiding the interference of unrelated objects. In this work, we propose SceneDiver, a coarse-to-fine, two-stage focus plan generation pipeline, to tackle the key technical challenge of identifying the essential objects from scenes with complicated visual and semantic structures. First, the VLM executes a virtual, coarse-grained plan over the scene graph. Then, it zooms into local neighborhoods around each graph node to perform fine-grained focusing. The resulting focus map controls the attentions of VLM during decision making, steering the model toward task-critical objects and alleviating the perceptual hallucination of VLMs. Experimental results under robotic manipulation and room navigation benchmarks demonstrate that our approach successfully overcomes the perceptual limitation of VLMs, meanwhile significantly enhances their decision-making performance and generalization ability. Our code will be open-released upon acceptance.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 4289

Loading