Keywords: Goal-conditioned RL, CLIP, Robot Learning, Visual RL
TL;DR: We study how large scale pre-trained vision models can be leveraged for zero-shot goal specification for robot manipulation.
Abstract: Task specification is at the core of programming autonomous robots. A low-effort modality for task
specification is critical for engagement of non-expert end-users and ultimate adoption of personalized
robot agents. A widely studied approach to task specification is through goals, using either
compact state vectors or goal images from the same robot scene. The former is hard to interpret for
non-experts and necessitates detailed state estimation and scene understanding. The latter requires
the generation of desired goal image, which often requires a human to complete the task, defeating
the purpose of having autonomous robots. In this work, we explore alternate and more general
forms of goal specification that are expected to be easier for humans to specify and use such as
images obtained from the internet, hand sketches that provide a visual description of the desired
task, or simple language descriptions. As a preliminary step towards this, we investigate the capabilities
of large scale pre-trained models (foundation models) for zero-shot goal specification, and
find promising results in a collection of simulated robot manipulation tasks and real-world datasets.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/can-foundation-models-perform-zero-shot-task/code)
0 Replies
Loading