Keywords: Latent Mapping, Mobile Manipulation, Robot Learning, 3D Feature Representation, Vision–Language Model
TL;DR: We present an end-to-end policy learning approach that operates directly on a 3D latent map, enabling robot mobile manipulation with improved spatial and temporal understanding compared to image-based reasoning.
Abstract: This paper investigates whether mobile manipulation policies utilizing a 3D latent map achieve better spatial and temporal understanding compared to image-based reasoning. We introduce an end-to-end policy learning approach that operates directly on a 3D map of latent features, which (i) extends perception beyond the robot's current field of view and (ii) aggregates observations over time, resolving occlusions and suppressing noise. Our mapping approach incrementally fuses multiview observations into a grid of scene-specific latent features, while a shared pre-trained scene-agnostic decoder enables rapid online adaptation. Our policy design utilizes the feature map by receiving both global context obtained by tokenizing the scene-wide latent features and local perception information that injects nearby map features into observed visual embeddings. Experiments on object-picking tasks demonstrate that the map-conditioned policy reasons over the entire scene and successfully completes mobile manipulation tasks in novel layouts where target objects lie outside the robot's field of view.
Submission Number: 9
Loading