ManiTaskGen: A Comprehensive Task and Benchmark Generator for Vision-Language Models in Long-Horizon Embodied Planning
Abstract: Long-horizon manipulation task planning (e.g., object rearrangement) using vision-language models (VLMs) is a critical research direction in embodied AI. Although numerous recent works have proposed specific algorithms and models, their evaluations typically rely on manually selected scenes and a limited set of annotated tasks. We contend that such evaluation methods are neither comprehensive nor fair, and require significant manual annotation.
In this paper, we introduce an automated method for task generation and benchmark construction: given any interactive scene, our approach can generate a comprehensive set of plausible long-horizon manipulation tasks and automatically build a benchmark for evaluating vision-language planning models. Moreover, by applying our method to off-the-shelf interactive scenes in simulators, we provide a thorough evaluation and analysis of the performance of existing VLMs on these long-horizon planning tasks. We will open-source our code, offering a universal tool for generating tasks and benchmarks to evaluate VLMs for long-horizon embodied planning.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal application
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: english
Submission Number: 662
Loading