MotIF: Motion Instruction Fine-tuning

Minyoung Hwang; Joey Hejna; Dorsa Sadigh; Yonatan Bisk

MotIF: Motion Instruction Fine-tuning

Minyoung Hwang, Joey Hejna, Dorsa Sadigh, Yonatan Bisk

Published: 24 Oct 2024, Last Modified: 06 Nov 2024LEAP 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Model, Motion Discriminator, Success Detector

TL;DR: Evaluating robot motions involves more than just the start and end states; it's about how the task is performed. We propose motion instruction fine-tuning (MotIF) and MotIF-1K dataset to improve VLMs' ability to understand nuanced robotic motions.

Abstract:

While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs are trained only on single frames, and thus cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an aggregate input of multiple frames, they still fail to correctly detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories. MotIF assesses the success of robot motion given the image observation of the trajectory, task instruction, and motion description. Our model significantly outperforms state-of-the-art VLMs by at least twice in precision and 56.1% in recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in refining and terminating robot planning, and ranking trajectories on how they align with task and motion descriptions. Project page: https://motif-1k.github.io/

Submission Number: 12

Loading