Keywords: Active Perception, World Models, Object Localization
TL;DR: We introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that are grounded in the physical world.
Abstract: Active object localization remains a critical challenge for robots, requiring efficient
exploration of partially observable environments. However, state-of-the-art robot
policies either struggle to generalize beyond demonstration datasets (e.g., imitation
learning methods) or fail to generate physically grounded actions (e.g., VLMs).
To address these limitations, we introduce WoMAP (World Models for Active
Perception): a recipe for training open-vocabulary object localization policies that:
(i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data
generation without the need for expert demonstrations, (ii) distills dense rewards
signals from open-vocabulary object detectors, and (iii) leverages a latent world
model for dynamics and rewards prediction to ground high-level action proposals
at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP's superior performance in a wide range of zero-shot object localization tasks, with a 63\% success rate compared to 10\%success rate compared to a VLM baseline, and only a 10 - 20\% drop in performance when directly transferring from sim to real.
Spotlight: mp4
Submission Number: 829
Loading