Keywords: Embodied AI
TL;DR: A VLM-base embodied agent benchmark featured by native settings with decoupled task settings.
Abstract: Recent advances in vision–language models (VLMs) have shed light on human-level embodied intelligence. However, existing benchmark for VLM-driven embodied agent still rely on pre-defined high-level command or discretised action spaces—``non-native'' settings that diverge markedly from the real world. Moreover, current benchmarks focus exclusively on high-level tasks, while lacking collaborative evaluation and analysis on both low- and high-level. To bridge these gaps, we present \textbf{NativeEmbodied}, a challenging benchmark for VLM-driven embodied agents that adopts a unified, native low-level action space. Built upon diverse simulated scenes, NativeEmbodied first designs three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed and comprehensive performance analysis, we further decouple the entangled skills behind complex tasks and construct four types of low-level tasks, each corresponding to a key fundamental embodied skill.
This joint evaluation across task and skill granularities enables a fine-grained assessment of embodied agent. Comprehensive experiments on the best VLMs reveal pronounced deficiencies in certain fundamental embodied skills. Further analysis shows that these low-level bottlenecks severely constrain performance on high-level tasks. Our NativeEmbodied not only pinpoints the key challenges faced by current VLM-driven embodied agents, but also provides valuable insight for future development.
Submission Number: 56
Loading