4D Latent Mapping for Mobile Manipulation Policy Learning

Published: 27 May 2026, Last Modified: 27 May 2026ICRA 2026 SRRA Workshop LightningTalkPosterEveryoneRevisionsCC BY 4.0
Keywords: Latent Mapping, Mobile Manipulation, Vision-Language-Action Model, Dynamic Scene Representation
TL;DR: We present a VLA policy conditioned on a 4D latent map, enabling mobile manipulation with coherent spatiotemporal reasoning about the environment and the robot.
Abstract: Long-horizon robot mobile manipulation requires continual reasoning about self-localization, environmental changes, and task progress, all of which are challenging to infer directly from image observations alone. We show that conditioning a mobile manipulation policy on a 4D latent map improves spatiotemporal reasoning over long horizons. The 4D latent map represents both the environment and the articulated robot body as neural points in a shared latent space. At each time step, we update the map online from egocentric observations and proprioceptive state. As objects move, we construct 3D keypoint correspondences between consecutive frames and estimate per-instance rigid-body transforms to reposition the corresponding environment points. Forward kinematics updates robot points from the proprioceptive state. To condition a vision-language-action model on the map, we tokenize the map across multiple coordinate frames and spatial resolutions, providing the policy with both local and global context. Experiments on long-horizon mobile manipulation tasks in BEHAVIOR-1K show that the 4D map-conditioned policy outperforms image-only baselines by supporting both allocentric and egocentric spatial reasoning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 5
Loading