Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation

Published: 05 Apr 2024, Last Modified: 22 Apr 2024VLMNM 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Semantic Scene Understanding; Deep Learning for Visual Perception
TL;DR: We propose Chain-of-Visual-Residuals (CoVR) prompting, a method that connects visual understandings to reason about preferences from a long-horizon image sequence in tabletop manipulation environments.
Abstract: In this paper, we focus on the problem of inferring underlying human preferences from a sequence of raw visual observations in tabletop manipulation environments with a variety of object types, named Visual Preference Inference (VPI). To facilitate visual reasoning in the context of manipulation, we introduce the Chain-of-Visual-Residuals (CoVR) method. CoVR employs a prompting mechanism that describes the difference between the consecutive images (i.e., visual residuals) and incorporates such texts with a sequence of images to infer the user's preference. Code and videos are available at: https://joonhyung-lee.github.io/vpi/
Supplementary Material: zip
Submission Number: 27
Loading