everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Image-based autoregressive next-token prediction offers a promising avenue for developing world video simulators for autonomous driving. However, applications of these autoregressive models for common perception tasks such as geometric and semantic understanding remains under-explored, largely due to the difficulty of applying discrete token modeling to perception tasks. In this paper, we introduce PerceptionLM, an end-to-end framework that leverages autoregressive world simulators to effectively improve Perception tasks. It consists of a token-based pretraining stage and a novel fine-tuning stage that adapts discrete tokens to continuous embeddings for perception tasks. During pretraining, we leverage the world knowledge from Segment Anything and Depth Anything through autoregressive next-token prediction to imbue the model with world knowledge from multiple vision modalities. During fine-tuning, we propose a novel decoder adaptor to fuse discrete tokens with continuous embeddings from image encoders, which overcomes the limitations of discrete tokens. With PerceptionLM, we observe impressive scaling properties, where quality is consistently improved when providing more training compute or longer temporal context. On multiple public benchmarks including nuScenes, nuImages, Waymo Open Dataset, and Waymo Open Motion Dataset, PerceptionLM demonstrates significant performance improvements for common perception tasks such as depth estimation and semantic segmentation, highlighting its potential for scaling vision-only foundation models for autonomous driving.