Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Published: 09 May 2026, Last Modified: 09 May 2026Precognition 2026EveryoneRevisionsCC BY 4.0
Keywords: autonomous driving, 3d vision, pretraining, future feature and scene prediction, planning
TL;DR: A label-free framework for learning autonomous driving representations from YouTube videos.
Abstract: Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. We propose LFG (Learning to drive is a Free Gift), a label-free, teacher-guided framework that learns geometry-, motion-, and semantics-aware representations directly from unposed, single-view YouTube driving videos. LFG extends a feedforward 3D reconstruction backbone with a lightweight causal autoregressive module and multimodal teacher supervision to jointly predict current and short-horizon future point maps, camera poses, semantic segmentation, confidence maps, and motion masks—forming a unified pseudo-4D representation learned entirely without poses, labels, or LiDAR. On the NAVSIM planning benchmark, LFG achieves state-of-the-art performance using only a single front camera, outperforming multi-camera and LiDAR baselines while exhibiting strong data efficiency: with only 10% labeled data, LFG matches the full-data performance of DINOv3. It further transfers effectively to semantic segmentation, depth estimation, and trajectory prediction, positioning LFG as a compelling video-centric foundation model for autonomous driving.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 31
Loading