ManiTaskGen: A Comprehensive Task and Benchmark Generator for Vision-Language Models in Long-Horizon Embodied Planning

ManiTaskGen: A Comprehensive Task and Benchmark Generator for Vision-Language Models in Long-Horizon Embodied Planning

ACL ARR 2025 February Submission662 Authors

10 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Long-horizon manipulation task planning (e.g., object rearrangement) using vision-language models (VLMs) is a critical research direction in embodied AI. Although numerous recent works have proposed specific algorithms and models, their evaluations typically rely on manually selected scenes and a limited set of annotated tasks. We contend that such evaluation methods are neither comprehensive nor fair, and require significant manual annotation. In this paper, we introduce an automated method for task generation and benchmark construction: given any interactive scene, our approach can generate a comprehensive set of plausible long-horizon manipulation tasks and automatically build a benchmark for evaluating vision-language planning models. Moreover, by applying our method to off-the-shelf interactive scenes in simulators, we provide a thorough evaluation and analysis of the performance of existing VLMs on these long-horizon planning tasks. We will open-source our code, offering a universal tool for generating tasks and benchmarks to evaluate VLMs for long-horizon embodied planning.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, cross-modal application

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: english

Submission Number: 662

Loading