InstructPart: Affordance-based Part Segmentation from Language Instruction

Published: 12 Dec 2023, Last Modified: 26 Feb 2024PubLLM 2024EveryoneRevisionsBibTeXCC BY 4.0
Track Selection: Track 2: Designing, deploying and evaluating LLM-powered services with directly impacted communities
Keywords: Multimodal LLMs, Part Segmentation, Foundation Models for Robotics, Human Robot Interaction
TL;DR: We introduce a comprehensive dataset containing image observations and household task instructions with part segmentation masks, and evaluate VLMs on it for their proficiency in part-level task comprehension and grounding in daily scenarios.
Abstract: Recent advancements in Vision-Language Models (VLMs) have led to their increased application in robotic tasks. While the implementation of VLMs is primarily at the object level, the distinct affordances of an object's various parts — such as a knife's blade for cutting versus its handle for grasping — remain a challenge for current state-of-the-art models. Our investigations reveal that these models often fail to accurately segment parts based on task instructions, a capability crucial for precise robotic interactions. Addressing the lack of real-world datasets to evaluate these fine-grained tasks, we introduce a comprehensive dataset that includes image observations, task descriptions, and precise annotations for object-part interactions, complemented by part segmentation masks. We present an evaluation of common pre-trained VLMs using this benchmark, shedding light on the models' performance in understanding and executing part-level tasks within everyday contexts.
Supplementary Material: pdf
Submission Number: 4
Loading