World-simulation as pre-training for scalable perception

ICLR 2025 Conference Submission12772 Authors

28 Sept 2024 (modified: 13 Oct 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: autonomous driving; computer vision; autoregressive transformer; self-supervised learning
Abstract: Image-based autoregressive next-token prediction offers a promising avenue for developing world video simulators for autonomous driving. However, applications of these autoregressive models for common perception tasks such as geometric and semantic understanding remains under-explored, largely due to the difficulty of applying discrete token modeling to perception tasks. In this paper, we introduce PerceptionLM, an end-to-end framework that leverages autoregressive world simulators to effectively improve Perception tasks. It consists of a token-based pretraining stage and a novel fine-tuning stage that adapts discrete tokens to continuous embeddings for perception tasks. During pretraining, we leverage the world knowledge from Segment Anything and Depth Anything through autoregressive next-token prediction to imbue the model with world knowledge from multiple vision modalities. During fine-tuning, we propose a novel decoder adaptor to fuse discrete tokens with continuous embeddings from image encoders, which overcomes the limitations of discrete tokens. With PerceptionLM, we observe impressive scaling properties, where quality is consistently improved when providing more training compute or longer temporal context. On multiple public benchmarks including nuScenes, nuImages, Waymo Open Dataset, and Waymo Open Motion Dataset, PerceptionLM demonstrates significant performance improvements for common perception tasks such as depth estimation and semantic segmentation, highlighting its potential for scaling vision-only foundation models for autonomous driving.
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12772
Loading