ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

Abstract

Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.

ConsisDrive Overview

1. Instruction Following Ability

1.1 Text Prompt Editing

ConsisDrive can generate diverse weather scenarios from the same control conditions.
It means it is possible to simulate extreme weather conditions for training perception models.
We append descriptors like "sunny," "rainy," or "night" to text prompts for video editing.
The videos below show the "control condition", followed by "sunny", "rainy", and "night" scenarios.

2. Stochastic Diversity of Generation

ConsisDrive can generate diverse videos using varying stochastic noise inputs and the same control conditions.
The videos below show the "control condition" and two sampled videos with different stochastic noise inputs in each line.
Both sampled videos in the second and third line adhere to the constraints defined in the first line.

3. Generalization to Private Dataset

In addition to public datasets, we trained on a 200 hour private dataset. The results below demonstrate that we achieved similar generation quality and control performance as on nuScenes, highlighting the generalization capability of our method.

4. Instance Identity Preservation

4.1 Comparison with Baseline

4.2 Instance Attributes Binding && Propagation

4.3 Foreground Small Objects Emphasis

ConsisDrive enhances the fidelity of small and challenging objects(e.g.,pedestrians and bicycles).
We overlay the 3D bounding box projections onto the generated videos.