Keywords: End-to-End Autonomous Driving, Vision-Language-Action Model, Generative Planner
Abstract: Recently, Vision-Language Models (VLMs) have shown promising prospects in autonomous driving tasks by leveraging rich world knowledge. However, current methods still face significant challenges in aligning the semantic space with the action space and struggle to maintain robust performance in closed-loop evaluations and long-tail scenarios. To address these challenges, we propose BridgEAD in this paper, a novel Vision-Language-Action (VLA) framework for end-to-end autonomous driving that unifies action planning and semantic reasoning. It integrates multi-view visual inputs and historical context into an unmodified VLM backbone for driving scenario reasoning, and leverages a diffusion-based generative planner to further align multimodal scene representations with precise trajectories. We employ supervised fine-tuning for model training to enable end-to-end optimization, thereby endowing BridgEAD with visual question-answering and trajectory planning capabilities. Extensive experiments on multiple benchmarks, including nuScenes, NAVSIM, and Bench2Drive, demonstrate that BridgEAD achieves superior trajectory planning performance in both open-loop and closed-loop evaluations across challenging driving environments. Qualitative results further highlight BridgEAD’s strong semantic reasoning ability in driving-related question-answering tasks. We will make our code publicly available upon publication to support future research in this domain.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 17528
Loading