Keywords: vlm, RL, reasoning, MLLM
Abstract: Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. With the rise of Multimodal Large Language Models (MLLMs) as central planners in robotic systems, the lack of methods for equipping MLLMs with active perception has become a key gap. We first provide a systematic definition of MLLM-based active perception tasks and show that GPT-o3's zoom-in strategy can be viewed as a special case, though it suffers from low efficiency and inaccurate region selection. To address these issues, we propose Active-o3, a reinforcement learning framework built on GRPO that equips MLLMs with active perception capabilities. Leveraging a modular sensing-action design and a dual-form reward, Active-o3 autonomously learns efficient and stable region selection strategies without explicit supervision. We further establish a comprehensive benchmark covering both open-world tasks (small/dense-object grounding) and domain-specific scenarios (remote sensing, autonomous driving, interactive segmentation). Experimental results demonstrate that Active-o3 significantly enhances active perception capabilities compared to Qwen2.5-VL-CoT. Moreover, we show that our RL framework not only preserves the model’s general understanding ability but can also serve as a proxy task for leveraging perception data, further improving performance on benchmarks such as RealWorldQA. We hope that our work can provide a simple codebase and unified evaluation protocol to facilitate future research on active perception with MLLMs.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2529
Loading