ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

ICLR 2026 Conference Submission18458 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Temporal Consistency, Video Generation

TL;DR: We present a identity-preserving world model that generates realistic multi-view driving videos with superior fine-grained temporal consistency.

Abstract: Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce **ConsisDrive**, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 18458

Loading