LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Rearrangement

ACL ARR 2025 May Submission7222 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The integration of embodied agents with foundation models has led to notable progress in embodied instruction following. Specifically, the advanced reasoning capabilities of large language models (LLMs) and the visual perception skills of vision-language models (VLMs) enable robots to tackle complex, long-horizon tasks without requiring costly annotated demonstrations. However, there is still a lack of public benchmarks for evaluating the long-horizon reasoning capabilities of language-conditioned robots across different scenarios. To address this gap, this work introduces \textit{LoHoRavens}, a simulation benchmark designed for tabletop rearrangement tasks. It includes 40 challenging tasks and addresses various aspects of long-horizon reasoning such as color, size, spatiality, arithmetic, reference, shape construction, commonsense, and occlusion. We evaluate two prevalent methods \new{with current advanced VLMs (such as GPT-4o and Gemini 2.0 Flash)} on this benchmark and conduct a thorough analysis of their reasoning performance. Our findings indicate that both methods struggle with numerous tasks, shedding light on the most challenging contexts that the community should be focusing on, as well as underscoring the need for continued effort to bridge gaps between modalities and improve current models.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, embodied agents
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 7222
Loading