Keywords: End-to-end driving, Generative video modeling, Imitation learning
TL;DR: We show that large-scale generative video pretraining enables state-of-the-art end-to-end autonomous driving from raw video alone, without any manual annotations.
Abstract: We present VaVAM, a Video-Action Model for end-to-end autonomous driving trained entirely on unannotated videos. VaVAM combines large-scale generative video pretraining with lightweight imitation learning to form a self-supervised perception-to-action pipeline. We first train a transformer-based video model on 1,800+ hours of public, diverse driving footage to learn spatio-temporal representations through autoregressive prediction. We then use these representations to train an action predictor by imitation learning on driving trajectories. This produces rich, driving-relevant features that enable strong generalization in closed-loop evaluations. VaVAM achieves state-of-the-art performance in safety-critical scenarios on the NeuroNCAP benchmark, demonstrating the practical value of generative video pretraining for real-world driving.
Serve As Reviewer: ~Florent_Bartoccioni2
Submission Number: 4
Loading