WoMAP: World Models For Embodied Open-Vocabulary Object Localization

Tenny Yin; Zhiting Mei; Tao Sun; Lihan Zha; Miyu Yamane; Emily Zhou; Jeremy Bao; Ola Sho; Anirudha Majumdar

WoMAP: World Models For Embodied Open-Vocabulary Object Localization

Tenny Yin, Zhiting Mei, Tao Sun, Lihan Zha, Miyu Yamane, Emily Zhou, Jeremy Bao, Ola Sho, Anirudha Majumdar

Published: 21 Jun 2025, Last Modified: 21 Jun 2025SWOMO RSS25 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Active Perception, World Models, Object Localization

TL;DR: We introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that are grounded in the physical world.

Abstract: Language-instructed active object localization remains a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art approaches either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception), a recipe for training open-vocabulary object localization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open-vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high-level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP's superior performance in a wide range of zero-shot object localization tasks, with more than 7x and 2.5x higher success rates compared to VLM and diffusion policy baselines, respectively. Further, we show that WoMAP achieves strong sim-to-real transfer in experiments on a TidyBot robot.

Submission Number: 9

Loading