Real-World Robot Learning with Masked Visual Pre-training

Ilija Radosavovic; Tete Xiao; Stephen James; Pieter Abbeel; Jitendra Malik; Trevor Darrell

Real-World Robot Learning with Masked Visual Pre-training

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, Trevor Darrell

Published: 10 Sept 2022, Last Modified: 05 May 2023CoRL 2022 OralReaders: Everyone

Keywords: Self-Supervised Learning, Visual Representations, Robot Learning

Abstract: In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75\%), supervised ImageNet pre-training (up to 81\%), and training from scratch (up to 81\%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

Student First Author: yes

Website: https://tetexiao.com/projects/real-mvp

Code: https://github.com/ir413/mvp

10 Replies

Loading