End-to-End Driving through Generative Video Pretraining

Florent Bartoccioni; Elias Ramzi; Victor Besnier; Shashanka Venkataramanan; Tuan-Hung Vu; Yihong Xu; Loick Chambon; Spyros Gidaris; Serkan odabas; David Hurych; Alexandre Boulch; Mickael Chen; Eloi Zablocki; Andrei Bursuc; Eduardo Valle; Matthieu Cord

End-to-End Driving through Generative Video Pretraining

Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan odabas, David Hurych, Alexandre Boulch, Mickael Chen, Eloi Zablocki, Andrei Bursuc, Eduardo Valle, Matthieu Cord

Published: 18 Sept 2025, Last Modified: 23 Sept 2025LSRW PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: End-to-end driving, Generative video modeling, Imitation learning

TL;DR: We show that large-scale generative video pretraining enables state-of-the-art end-to-end autonomous driving from raw video alone, without any manual annotations.

Abstract: We present VaVAM, a Video-Action Model for end-to-end autonomous driving trained entirely on unannotated videos. VaVAM combines large-scale generative video pretraining with lightweight imitation learning to form a self-supervised perception-to-action pipeline. We first train a transformer-based video model on 1,800+ hours of public, diverse driving footage to learn spatio-temporal representations through autoregressive prediction. We then use these representations to train an action predictor by imitation learning on driving trajectories. This produces rich, driving-relevant features that enable strong generalization in closed-loop evaluations. VaVAM achieves state-of-the-art performance in safety-critical scenarios on the NeuroNCAP benchmark, demonstrating the practical value of generative video pretraining for real-world driving.

Serve As Reviewer: ~Florent_Bartoccioni2

Submission Number: 4

Loading