Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Published: 05 Sept 2024, Last Modified: 08 Nov 2024CoRL 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Task Success Prediction, Open-Vocabulary Manipulation, Multi-Level Aligned Visual Representation
TL;DR: We introduce a task success prediction model for open-vocabulary manipulation. The model focuses on the differences between multi-level aligned representations of images. It outperformed existing models, including representative multimodal LLMs.
Abstract: In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive $\lambda$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive $\lambda$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.
Supplementary Material: zip
Spotlight Video: mp4
Video: https://www.youtube.com/watch?v=Do3Ig3HqLN0
Website: https://5ei74r0.github.io/contrastive-lambda-repformer.page/
Code: https://github.com/keio-smilab24/contrastive-lambda-repformer
Publication Agreement: pdf
Student Paper: yes
Submission Number: 210
Loading