Keywords: 3D scene understanding, end-to-end learning based navigation, zero-shot generalization
TL;DR: A point-cloud-based, end-to-end navigation transformer that captures key 3D environmental features and generalizes across diverse robots in real-world settings.
Abstract: Humans can navigate and explore unfamiliar environments by leveraging prior experience to decide where to go and how to get there. This remarkable capability requires a geometric understanding of the surrounding scene. To this end, we introduce Point Navigation Transformer (PointNT), a foundational navigation and local exploration policy that utilizes raw 3D point cloud streams to propose plausible exploration targets along with waypoint trajectories to reach those targets. We replace PointNet’s computationally expensive invariance layer with a lightweight encoder designed to capture egocentric, large-scale environmental context. This change reduces the total parameter count by 5.7 X(0.97M to 0.17M) and lowers the average prediction error by 55% (3.71m to 1.68m). We further introduce an SE(2) matching loss to enhance spatial consistency. Compared to image-based approaches, our method demonstrates a richer understanding of geometric semantics, effectively distinguishing between similar and dissimilar scenes even without rich color information. We validate our approach through extensive indoor and outdoor experiments across previously unseen environments including mountainous terrain, dense forests, sandy beaches, and an underground tunnel using five different platforms in both simulation and the real world, all using a single policy in a zero-shot manner.
Submission Number: 2
Loading