SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World

Published: 01 Jan 2024, Last Modified: 18 May 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Reinforcement learning (RL) with dense rewards and imitation learning (IL) with human-generated trajectories are the most widely used approaches for training modern embodied agents. RL requires extensive reward shaping and auxiliary losses and is often too slow and ineffective for long-horizon tasks. While IL with human supervision is effective, collecting human trajectories at scale is extremely expensive. In this work, we show that imitating shortest-path planners in simulation produces agents that, given a language instruction, can proficiently navigate, ex-plore, and manipulate objects in both simulation and in the real world using only RGB sensors (no depth map or GPS coordinates). This surprising result is enabled by our end-to-end, transformer-based, Spocarchitecture, power-ful visual encoders paired with extensive image augmentation, and the dramatic scale and diversity of our training data: millions of frames of shortest-path-expert trajectories collected inside approximately 200,000 procedu-rally generated houses containing 40,000 unique 3D as-sets. Our models, data, training code, and newly proposed 10-task benchmarking suite Choresare available in spoc-robot.github.io.
Loading