Keywords: Active Perception, World Models, Object Localization
TL;DR: We introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that are grounded in the physical world.
Abstract: Language-instructed active object localization is a critical challenge for
robots, requiring efficient exploration of partially observable environments. How-
ever, state-of-the-art approaches either struggle to generalize beyond demonstration
datasets (e.g., imitation learning methods) or fail to generate physically grounded
actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World
Models for Active Perception): a recipe for training open-vocabulary object local-
ization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline
for scalable data generation without the need for expert demonstrations, (ii) distills
dense rewards signals from open-vocabulary object detectors, and (iii) leverages
a latent world model for dynamics and rewards prediction to ground high-level
action proposals at inference time. Rigorous simulation and hardware experiments
demonstrate WoMAP’s superior performance in a wide range of zero-shot object
localization tasks, with more than 7x and 2.5x higher success rates compared to
VLM and diffusion policy baselines, respectively. Further, we show that WoMAP
achieves strong sim-to-real transfer in experiments on a TidyBot robot.
Submission Number: 8
Loading