Active Perception for Autonomous Multi-View Geometrically Consistent Data Collection in Local Environments

Published: 16 May 2026, Last Modified: 16 May 2026ASAB 2026 OralEveryoneRevisionsCC BY 4.0
Keywords: Active Vision; Dataset Collection; Local Robot Perception
TL;DR: A robotic system where the robot uses a single user input to actively curate its own dataset to enable local robot perception.
Abstract: Robust deployment of robots in specific environments such as homes or nursing facilities requires reliable annotated data for objects in those environments. Despite the impressive progress in computer vision algorithms with the advent of deep learning and foundation models, there remains a bottleneck in generating high-quality, environment-specific data. In this paper, we formulate data collection as an active perception problem, where the robot purposefully moves to acquire informative observations. We present a system in which a robot follows a hemispherical trajectory to capture multi-view images of a scene. From a single seed annotation, provided via vision-language models using point or language prompts, our method leverages robot kinematics, camera intrinsics, and depth sensing to propagate annotations across views, producing a dense, 3D-consistent multi-view dataset. This dataset is then used to train a lightweight, deployable perception model tailored to the local environment. Across 5.3k images spanning 32 cluttered tabletop scenes and 30 object categories, models trained with our method outperform zero-shot Grounding-DINO + SAM by up to 33.5 mAP@50–95 while running at 58 FPS, demonstrating the effectiveness of purposeful robot motion for collecting reliable perception data.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 26
Loading