Open-World Object Manipulation using Pre-Trained Vision-Language Models

Austin Stone; Ted Xiao; Yao Lu; Keerthana Gopalakrishnan; Kuang-Huei Lee; Quan Vuong; Paul Wohlhart; Sean Kirmani; Brianna Zitkovich; Fei Xia; Chelsea Finn; Karol Hausman

Open-World Object Manipulation using Pre-Trained Vision-Language Models

Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Sean Kirmani, Brianna Zitkovich, Fei Xia, Chelsea Finn, Karol Hausman

Published: 30 Aug 2023, Last Modified: 20 Apr 2025CoRL 2023 PosterReaders: Everyone

Abstract: For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary, e.g. ``can you get me the pink stuffed whale?'' to their sensory observations and actions. This brings up a notably difficult challenge for robots: while robot learning approaches allow robots to learn many different behaviors from first-hand experience, it is impractical for robots to have first-hand experiences that span all of this semantic information. We would like a robot's policy to be able to perceive and pick up the pink stuffed whale, even if it has never seen any data interacting with a stuffed whale before. Fortunately, static data on the internet has vast semantic information, and this information is captured in pre-trained vision-language models. In this paper, we study whether we can interface robot policies with these pre-trained models, with the aim of allowing robots to complete instructions involving object categories that the robot has never seen first-hand. We develop a simple approach, which we call Manipulation of Open-World Objects (MOO), which leverages a pre-trained vision-language model to extract object-identifying information from the language command and image, and conditions the robot policy on the current image, the instruction, and the extracted object information. In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments. In addition, we show how MOO generalizes to other, non-language-based input modalities to specify the object of interest such as finger pointing, and how it can be further extended to enable open-world navigation and manipulation. The project’s website and evaluation videos can be found at https://robot-moo.github.io/.

Student First Author: no

Supplementary Material: zip

Instructions: I have read the instructions for authors (https://corl2023.org/instructions-for-authors/)

Video: https://www.youtube.com/watch?v=KyvHTbLRovI

Website: https://robot-moo.github.io/

Publication Agreement: pdf

Poster Spotlight Video: mp4

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/open-world-object-manipulation-using-pre/code)

11 Replies

Loading