What Makes Certain Pre-Trained Visual Representations Better for Robotic Learning?

Kyle Hsu; Tyler Ga Wei Lum; Ruohan Gao; Shixiang Shane Gu; Jiajun Wu; Chelsea Finn

What Makes Certain Pre-Trained Visual Representations Better for Robotic Learning?

Kyle Hsu, Tyler Ga Wei Lum, Ruohan Gao, Shixiang Shane Gu, Jiajun Wu, Chelsea Finn

08 Oct 2022 (modified: 05 May 2023)Deep RL Workshop 2022Readers: Everyone

Keywords: pre-training, robotics, foundation models, vision, imitation learning, representation similarity analysis, intrinsic dimensionality

TL;DR: We empirically analyze the use of visual representations pre-trained on diverse, non-robotic datasets for learning robot manipulation tasks and find several properties that moderately and consistently correlate with task success.

Abstract: Deep learning for robotics is data-intensive, but collecting high-quality robotics data at scale is prohibitively expensive. One approach to mitigate this is to leverage visual representations pre-trained on relatively abundant non-robotic datasets. So far, existing works have focused on proposing pre-training strategies and assessing them via ablation studies, giving high-level knowledge of how pre-training design choices affect downstream performance. However, the significant gap in data and objective between the two stages motivates a more detailed understanding of what properties of better pre-trained visual representations enable their comparative advantage. In this work, we empirically analyze the representations of robotic manipulation data from several standard benchmarks under a variety of pre-trained models, correlating key metrics of the representations with closed-loop task performance after behavior cloning. We find evidence that suggests our proposed metrics have substantive predictive power for downstream robotic learning.

0 Replies

Loading