Keywords: VLA, World model, End-to-end autonomous driving
TL;DR: DriveVLA-W0 uses world modeling to overcome the supervision deficit in VLA models, amplifying data scaling laws on large-scale driving datasets.
Abstract: Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence.
However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized.
To remedy this, we propose DriveVLA-W0, a training paradigm that employs world modeling to predict future images.
This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment.
We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features.
Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment.
Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines.
Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 349
Loading