Keywords: Embodied AI, Embodied Reasoning, Spatial Reasoning, Multimodal Large Language Models, 3D Large Language Models
Abstract: Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, **Geometric Adaptability Gap:** models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, **Embodiment Constraint Gap**: prior work often neglects the physical constraints of real robots, resulting in task plans that are theoretically valid but practically infeasible.To address these gaps, we introduce **OmniEVA** -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a **Task-Adaptive 3D Grounding** mechanism, which uses a gated router to dynamically inject 3D features based on task context, enabling selective geometric reasoning. (2) an **Embodiment-Aware Reasoning** framework that incorporates task goals and physical constraints into the reasoning loop, ensuring executable plans. Extensive experiments show that OmniEVA achieves state-of-the-art performance on 7 of 8 embodied reasoning benchmarks and excels in downstream tasks such as object navigation and mobile manipulation. Evaluations on proposed primitive and composite benchmarks confirm its robust and versatile planning capabilities.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 4225
Loading