everyone
since 07 May 2025">EveryoneRevisionsBibTeXCC BY 4.0
In this paper, we study the problem of developing an end-to-end robot navigation policy capable of generalizing to deployment in highly diverse outdoor and indoor environments as well as working on multiple robotic hardware. Our goal is to train a single end-to-end policy capable of navigation over hundreds of meters, which learns reasonable conventions, such as not disturbing human behavior, staying on paths, and avoiding collisions with cross-embodiment performance. Training such an end-to-end policy requires large amounts of diverse data to ensure broad coverage of the set of all possible environments. Previous navigation works have relied on centrally collected datasets generated by robotics researchers. While these datasets tend to be high-quality, their total size amounts to only dozens of hours, which is sufficient to learn simple abilities but not scalable for learning more capable policies.
Facing this data limitation, we turn our attention to making use of more abundant sources of passive data, including crowd-sourced data with non-expert demonstrators and action-free in-the-wild videos. However, the passive data, with noisy action labeling or no action labeling, makes it difficult to train good policies directly on these datasets. To address this issue, we propose a model-based reannotation approach, which reannotates the action labels using the trained short-horizon policy and trains a more challenging long-horizon navigation policy with the reannotated action labels. Our approach is evaluated with multiple robots in diverse environments, including human-occupied spaces across six countries, to analyze the capabilities of our method, such as cross-embodiment performance.
Developing broadly generalizable visual navigation policies for robots is a significant challenge, primarily constrained by the availability of large-scale, diverse training data. While curated datasets collected by researchers offer high quality, their limited size restricts policy generalization. To overcome this, we explore leveraging abundant, passively collected data sources, including large volumes of crowd-sourced teleoperation data and unlabeled YouTube videos, despite their potential for lower quality or missing action labels. We propose Model-Based ReAnnotation (MBRA), a framework that utilizes a learned short-horizon, model-based expert model to relabel or generate high-quality actions for these passive datasets. This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints. We demonstrate that LogoNav, trained using MBRA-processed data, achieves state-of-the-art performance, enabling robust navigation over distances exceeding 300 meters in previously unseen indoor and outdoor environments. Our extensive real-world evaluations, conducted across a fleet of robots (including quadrupeds) in six cities on three continents, validate the policy's ability to generalize and navigate effectively even amidst pedestrians in crowded settings.