ImageDriver: Let Vision-Language-Action Models Drive on 2D Images

19 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Autonomous Driving; End-to-End; Vision-Language Models; Reinforcement Learning
TL;DR: We propose a new end-to-end autonomous driving VLA, ImageDriver, along with carefully curated dataset and training procedure.
Abstract: Vision-language-action models (VLAs) in autonomous driving, which focus on 3D scene understanding and motion planning, confront a fundamental modality gap: pretrained only on image-text corpora, they inherently lack native 3D spatial comprehension. This limitation either yields coarse-grained textual interpretations of the driving scene or necessitates the integration of computationally expensive, auxiliary 3D modules. In this work, we challenge this prevailing convention by introducing ImageDriver, a novel VLA framework that circumvents the dependency on 3D data. It reformulates scene understanding and planning by recasting them as 2D object detection and 2D trajectory prediction tasks, executed directly on the image plane. By leveraging the intrinsic multimodal grounding of Vision-Language Models (VLMs), our method achieves a four-step pipeline: egocentric consistent perception, geometrically grounded reasoning, high-level meta-action prediction, and trajectory prediction, all in a fully differentiable and low-latency manner. We propose a two-stage knowledge-seeded policy optimization paradigm, which first fine-tunes ImageDriver on a multi-task mixed dataset to learn driving knowledge. To holistically optimize the agent’s reasoning and decision-making, we further employ the Group Relative Policy Optimization (GRPO) algorithm to enforce end-to-end policy coherence across the complete VLA pipeline, from perception to planning. The superiority and versatility of our method are fully demonstrated by achieving state-of-the-art or competitive performance across detection, meta-action and trajectory prediction tasks.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 15704
Loading