What Do We Learn from a Large-Scale Study of Pre-Trained Visual Representations in Sim and Real Environments?

Sneha Silwal; Karmesh Yadav; Tingfan Wu; Jay Vakil; Arjun Majumdar; Sergio Arnaud; Claire Chen; Vincent-Pierre Berges; Dhruv Batra; Aravind Rajeswaran; Mrinal Kalakrishnan; Franziska Meier; Oleksandr Maksymets

What Do We Learn from a Large-Scale Study of Pre-Trained Visual Representations in Sim and Real Environments?

Sneha Silwal, Karmesh Yadav, Tingfan Wu, Jay Vakil, Arjun Majumdar, Sergio Arnaud, Claire Chen, Vincent-Pierre Berges, Dhruv Batra, Aravind Rajeswaran, Mrinal Kalakrishnan, Franziska Meier, Oleksandr Maksymets

Published: 16 Apr 2024, Last Modified: 02 May 2024MoMa WS 2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: robotics, sim2real, representation learning, manipulation, navigation

TL;DR: A large empirical study on how the performance of PVRs in simulation are predictive of their performance on real robots.

Abstract: We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study involves five different PVRs, each trained for five distinct manipulation or indoor navigation tasks. We performed this evaluation using three different robots and two different policy learning paradigms. From this effort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held-out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance.

Submission Number: 15

Loading