Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
Keywords: autonomous driving, 3d vision, pretraining, future feature and scene prediction, planning
TL;DR: A label-free framework for learning autonomous driving representations from YouTube videos.
Abstract: Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet
their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D
geometry. We propose LFG (Learning to drive is a Free Gift), a label-free, teacher-guided framework that learns
geometry-, motion-, and semantics-aware representations directly from unposed, single-view YouTube driving videos. LFG
extends a feedforward 3D reconstruction backbone with a lightweight causal autoregressive module and multimodal
teacher supervision to jointly predict current and short-horizon future point maps, camera poses, semantic
segmentation, confidence maps, and motion masks—forming a unified pseudo-4D representation learned entirely without
poses, labels, or LiDAR. On the NAVSIM planning benchmark, LFG achieves state-of-the-art performance using only a
single front camera, outperforming multi-camera and LiDAR baselines while exhibiting strong data efficiency: with only
10% labeled data, LFG matches the full-data performance of DINOv3. It further transfers effectively to semantic
segmentation, depth estimation, and trajectory prediction, positioning LFG as a compelling video-centric foundation
model for autonomous driving.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 31
Loading