Thoughts and Lessons on Using Visual Foundation Models for Manipulation

Ryan Chen; Ziteng Pang; Bradly C. Stadie

Thoughts and Lessons on Using Visual Foundation Models for Manipulation

Ryan Chen, Ziteng Pang, Bradly C. Stadie

Published: 13 Jun 2025, Last Modified: 13 Jun 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Training vision-based robotic systems from scratch is both computationally expensive and memory intensive. To mitigate these challenges, recent approaches forgo end-to-end training in favor of adopting visual representations from visual foundation models -- large scale models designed for broad task transferability. Recent years have seen numerous vision foundation models emerge, including several designed specifically for manipulation tasks. However, we still lack clear principles for what makes these models effective for robotics applications. To address this gap, we systematically evaluate vision foundation models to understand what makes them effective for offline robotic learning. We find that across eleven diverse vision encoders, a representation's ability to reconstruct edges and predict keypoints strongly correlates with its performance on manipulation tasks. Extensive correlation analysis across 21 manipulation tasks consistently shows that representations preserving edge and keypoint information achieve the highest environment success rates. These findings appear to challenge conventional wisdom about holistic reconstruction-based pretraining and offer a new lens for understanding what makes vision representations effective for robotics.

Submission Length: Regular submission (no more than 12 pages of main content)

Code: https://github.com/rchennnn/thoughts_and_lessons_vision_manipulation

Assigned Action Editor: ~David_Fouhey2

Submission Number: 4087

Loading