LEAP: Logical Embodied Action Planning for Long-Horizon Robotic Tasks via Generative Vision-Language Alignment

LEAP: Logical Embodied Action Planning for Long-Horizon Robotic Tasks via Generative Vision-Language Alignment

ACL ARR 2026 January Submission9461 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: embodied AI, vision-language-action models, long-horizon planning, implicit reasoning, robotic manipulation, full-parameter fine-tuning

Abstract: Vision-Language-Action models (VLAs) have emerged as a promising paradigm for generalizable sensorimotor control by leveraging pretrained vision-language models. However, despite their efficacy in learning direct input-output mappings, current VLAs struggle with long-horizon tasks that demand an understanding of physical constraints and logical reasoning. In this paper, we introduce \textsc{Leap} (Logical Embodied Action Planning), a framework that empowers a compact 2B VLM to master complex, multi-step planning tasks via Full Parameter Supervised Fine-Tuning. \textsc{Leap} learns to generate coherent action blueprints directly from single observations, effectively bridging the gap between high-level reasoning and low-level execution. Experimental results on VLABench demonstrate that \textsc{Leap} achieves superior performance, particularly in the Physics Law dimension, where it outperforms significantly larger baselines (e.g., 3B and 8B models). Specifically, \textsc{Leap} achieves a score of 30.3 on the Physics Law dimension, surpassing Qwen2.5-VL (17.4) by a substantial margin.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, cross-modal application

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 9461

Loading