World-simulation as pre-training for scalable perception

Hao Xiang; Zhaoqi Leng; Alex Zihao Zhu; Kan Chen; Mengtian Li; Xia Chen; Yingwei Li; Tong He; Yanhui Liang; Junwen Yao; Yan Xu; Anant Subramanian; Cheolho Park; Runsheng Xu; Dragomir Anguelov; Mingxing Tan

World-simulation as pre-training for scalable perception

Hao Xiang, Zhaoqi Leng, Alex Zihao Zhu, Kan Chen, Mengtian Li, Xia Chen, Yingwei Li, Tong He, Yanhui Liang, Junwen Yao, Yan Xu, Anant Subramanian, Cheolho Park, Runsheng Xu, Dragomir Anguelov, Mingxing Tan

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: autonomous driving; computer vision; autoregressive transformer; self-supervised learning

Abstract: Image-based autoregressive next-token prediction offers a promising avenue for developing world video simulators for autonomous driving. However, applications of these autoregressive models for common perception tasks such as geometric and semantic understanding remains under-explored, largely due to the difficulty of applying discrete token modeling to perception tasks. In this paper, we introduce PerceptionLM, an end-to-end framework that leverages autoregressive world simulators to effectively improve Perception tasks. It consists of a token-based pretraining stage and a novel fine-tuning stage that adapts discrete tokens to continuous embeddings for perception tasks. During pretraining, we leverage the world knowledge from Segment Anything and Depth Anything through autoregressive next-token prediction to imbue the model with world knowledge from multiple vision modalities. During fine-tuning, we propose a novel decoder adaptor to fuse discrete tokens with continuous embeddings from image encoders, which overcomes the limitations of discrete tokens. With PerceptionLM, we observe impressive scaling properties, where quality is consistently improved when providing more training compute or longer temporal context. On multiple public benchmarks including nuScenes, nuImages, Waymo Open Dataset, and Waymo Open Motion Dataset, PerceptionLM demonstrates significant performance improvements for common perception tasks such as depth estimation and semantic segmentation, highlighting its potential for scaling vision-only foundation models for autonomous driving.

Primary Area: applications to robotics, autonomy, planning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12772

Loading