EVLP: Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning

Xinyan Cai; Qiang Guan; Shiguang Wu; DaFeng Chi; Yuzheng Zhuang; Xingyue Quan; Jianye HAO

EVLP: Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning

Xinyan Cai, Qiang Guan, Shiguang Wu, DaFeng Chi, Yuzheng Zhuang, Xingyue Quan, Jianye HAO

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Embodied intelligence, multimodal large language models, reinforcement learning, long-sequence planning

TL;DR: An unified multimodal planning framework for complex long-horizon tasks with dynamic learning and reinforced alignment

Abstract: In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, leading to inconsistencies in multimodal planning. To address this challenge, we present EVLP (Embodied Vision-Language Planner), an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: 1. Unified Multimodal Generation Framework: For understanding, we integrate semantic information with spatial features to provide comprehensive visual perception. For generation, we directly learn the joint distribution of discrete images for one-step visual synthesis, enabling coordinated language-visual modeling through learnable cross-modal attention mechanisms. 2. Dynamic Perception Pretraining: We propose a bidirectional dynamic alignment strategy employing inverse dynamics tasks and forward dynamics tasks, effectively strengthening multimodal correlations within a unified feature space. 3. Reinforced Supervised Fine-Tuning: While conducting instruction-based fine-tuning in the unified generation space, we construct a reinforce loss to align the spatial logic between textual actions and generated images, enabling the model to acquire spatio-aware multimodal planning capabilities.Comprehensive evaluations on multiple complex tasks demonstrate that EVLP significantly outperforms competitive baselines in both instruction execution accuracy and task success rate, benefiting from its unified multimodal architecture and well-designed training pipeline. Extensive ablation studies further validate the rationality of our framework design.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 12093

Loading