Keywords: Temporal Consistency, Video Generation
TL;DR: We present a identity-preserving world model that generates realistic multi-view driving videos with superior fine-grained temporal consistency.
Abstract: Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos.
Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints.
We introduce **ConsisDrive**, an identity-preserving driving world model designed to enforce temporal consistency at the instance level.
Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks
to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions,
thereby preserving object identity consistency;
and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity.
By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 18458
Loading