Keywords: VLM reasoning, navigation, active vision
Abstract: Vision-Language Models (VLMs) face a critical challenge in active vision: determining where to look in 3D environments to answer questions or locate objects. While reinforcement learning (RL) over chains of thought has boosted performance in visual tasks with static inputs like passive image understanding, extending this reasoning-based approach to active vision remains an open challenge. We introduce AViRRL—Active Visual Reasoning with Reinforcement Learning—a multi-turn reinforcement learning approach to train vision-language models to acquire reasoning strategies for active visual exploration. Our approach combines two key components: (1) a tree-search-guided data generation method for warm starting active vision reasoning strategies in VLMs, and (2) online multi-turn reinforcement learning that optimizes full thought–action trajectories based on task success. On TinyNav, a benchmark for visual search of small objects in realistic 3D environments, we show that our method significantly outperforms both visual navigation prompting methods and single-step reinforcement learning with behavior cloning rewards. Compared to baselines, our agent exhibits more effective active vision behaviors, including exploration efficiency and reasoning aligned with visual state. Our work shows that reinforcement learning on thought-action trajectories enables VLMs to develop active visual reasoning, shifting from passive perception to autonomous, goal-directed 3D exploration.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8155
Loading