Abstract: The integration of embodied agents with foundation models has led to notable
progress in embodied instruction following. Specifically, the advanced
reasoning capabilities of large language models (LLMs) and the visual
perception skills of vision-language models (VLMs) enable robots to tackle
complex, long-horizon tasks without requiring costly annotated demonstrations.
However, there is still a lack of public benchmarks for evaluating the
long-horizon reasoning capabilities of language-conditioned robots across
different scenarios.
To address this gap, this work introduces \textit{LoHoRavens}, a simulation
benchmark designed for tabletop rearrangement tasks. It includes 40 challenging
tasks and addresses various aspects of long-horizon reasoning such as color,
size, spatiality, arithmetic, reference, shape construction, commonsense, and
occlusion. We evaluate two prevalent methods \new{with current advanced VLMs
(such as GPT-4o and Gemini 2.0 Flash)} on this benchmark and conduct a
thorough analysis of their reasoning performance.
Our findings indicate that both methods struggle with numerous tasks, shedding
light on the most challenging contexts that the community should be focusing on,
as well as underscoring the need for continued effort to bridge
gaps between modalities and improve current models.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, embodied agents
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 7222
Loading