Video2Demo: Grounding Videos in State-Action Demonstrations

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: multimodal applications, vision language models, large language models, task planning, open-vocabulary recognition
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: using pre-trained VLMs and LLMs to construct coherent state-action sequences from video demonstrations
Abstract: Vision-language demonstrations provide a natural way for users to teach robots everyday tasks. However, for effective imitation learning, these demonstrations must be perceptually grounded in the robot's states and actions. While prior works train task-specific models to predict state-actions from images, these often require extensive manual annotation and fail to generalize to complex scenes. In this work, we leverage pre-trained instruction-following Vision-Language Models (VLMs) that have shown impressive zero-shot generalization for detailed caption generation. However, VLM captions, while descriptive, fail to maintain the structure and temporal consistency required to track object states over time. We propose a novel approach, Video2Demo, that uses GPT-4 to interactively query a generative VLM to construct temporally coherent state-action sequences. These sequences are in turn fed into a language model to generate robot task code that faithfully imitates the demonstration. We evaluate on a large-scale human activity dataset, EPIC-Kitchens, and show that Video2Demo outperforms pure VLM-based approaches, resulting in accurate robot task code.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8346
Loading