Abstract: Training vision-based robotic systems from scratch is both computationally expensive and memory intensive. To mitigate these challenges, recent approaches forgo end-to-end training in favor of adopting visual representations from visual foundation models -- large scale models designed for broad task transferability. Recent years have seen numerous vision foundation models emerge, including several designed specifically for manipulation tasks. However, we still lack clear principles for what makes these models effective for robotics applications. To address this gap, we systematically evaluate vision foundation models to understand what makes them effective for offline robotic learning. We find that across nine diverse vision encoders, a representation's ability to reconstruct edges and predict key points strongly correlates with its performance on manipulation tasks. Extensive correlation analysis across 21 manipulation tasks consistently shows that representations preserving edge and key point information achieve the highest environment success rates. These findings appear to challenge conventional wisdom about reconstruction-based pre-training and offer a new lens for understanding what makes vision representations effective for robotics.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~David_Fouhey2
Submission Number: 4087
Loading