Human-centered In-building Embodied Delivery Benchmark

Human-centered In-building Embodied Delivery Benchmark

ACL ARR 2025 May Submission2385 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recently, the concept of embodied intelligence has been widely accepted and popularized, leading people to naturally consider the potential for commercialization in this field. In this work, we propose a specific real-world scenario simulation --- human-centered in-building embodied delivery. Furthermore, for this scenario, we have developed a brand-new virtual environment system from scratch, constructing a multi-level connected building space modeled after a polar research station. This environment also includes autonomous human characters and robots with grasping and mobility capabilities, as well as a large number of interactive items. Based on this environment, we have built a delivery dataset containing 13k language instructions to guide robots in providing services. We simulate human behavior through human characters and sample their various needs in daily life. Finally, we proposed a method centered around a large multimodal model to serve as the baseline system for this dataset. Compared to past embodied data work, our work focuses on an immersive virtual environment centered around human-robot interaction for industrial-grade scenarios. We believe this will bring new perspectives and exploration angles to the embodied community. Our code, dataset, and benchmark are publicly available.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Vision language navigation; Embodied AI; cross-modal application; benchmarking;

Contribution Types: Data resources

Languages Studied: English

Keywords: Vision language navigation; Embodied AI; cross-modal application; benchmarking;

Submission Number: 2385

Loading