R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair; Aravind Rajeswaran; Vikash Kumar; Chelsea Finn; Abhinav Gupta

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, Abhinav Gupta

Published: 10 Sept 2022, Last Modified: 27 Apr 2025CoRL 2022 PosterReaders: Everyone

Keywords: Visual Representation Learning, Robotic Manipulation

TL;DR: Pre-training a visual representation on diverse human video datasets, that can be downloaded and used off-the-shelf to enable more data efficient robot learning in simulation and the real world.

Abstract: We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations.

Student First Author: yes

Website: https://sites.google.com/view/robot-r3m/

Code: https://github.com/facebookresearch/r3m

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/r3m-a-universal-visual-representation-for/code)

10 Replies

Loading