LEAP: Logical Embodied Action Planning for Long-Horizon Robotic Tasks via Generative Vision-Language Alignment
Keywords: embodied AI, vision-language-action models, long-horizon planning, implicit reasoning, robotic manipulation, full-parameter fine-tuning
Abstract: Vision-Language-Action models (VLAs) have emerged as a promising paradigm for generalizable sensorimotor control by leveraging pretrained vision-language models. However, despite their efficacy in learning direct input-output mappings, current VLAs struggle with long-horizon tasks that demand an understanding of physical constraints and logical reasoning. In this paper, we introduce \textsc{Leap} (Logical Embodied Action Planning), a framework that empowers a compact 2B VLM to master complex, multi-step planning tasks via Full Parameter Supervised Fine-Tuning. \textsc{Leap} learns to generate coherent action blueprints directly from single observations, effectively bridging the gap between high-level reasoning and low-level execution. Experimental results on VLABench demonstrate that \textsc{Leap} achieves superior performance, particularly in the Physics Law dimension, where it outperforms significantly larger baselines (e.g., 3B and 8B models). Specifically, \textsc{Leap} achieves a score of 30.3 on the Physics Law dimension, surpassing Qwen2.5-VL (17.4) by a substantial margin.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal application
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 9461
Loading