Test-Time Domain Adaptation for Interactive Video Generation

Published: 24 Mar 2026, Last Modified: 24 Mar 2026CVPR 2026 Workshop VGBEEveryoneRevisionsBibTeXCC BY 4.0
Submission Type: Short Papers (up to 4 pages)
Supplementary Material: zip
Keywords: video generation, diffusion, plug-and-play, domain adaptation, interactive video generation, inference-time alignment
TL;DR: A training-free interactive video generation method that introduces Mask Normalization and Temporal Intrinsic Denoising to resolve covariate shift and initialization gap in masked attention, enabling precise trajectory control
Abstract: Text-conditioned diffusion models have emerged as powerful tools for video synthesis, yet enabling Interactive Video Generation (IVG), where users explicitly control object trajectories, remains challenging. While recent training-free approaches utilize attention masking for guidance, they often trade off perceptual quality for control. In this work, we identify the root causes of this degradation as two distinct domain shifts: 1) internal covariate shift induced by applying masks to pretrained models, and 2) an initialization gap, where random noise lacks alignment with trajectory conditions. We propose a test time domain adaptation framework to resolve these shifts. To this end, we first introduce Mask Normalization, a pre-normalization layer that mitigates (1) i.e., covariate shift via feature distribution alignment. Next, a Temporal Intrinsic Prior that enforces spatio-temporal consistency during denoising is introduced to bridge the initialization gap, thus addressing (2). Extensive evaluations on popular dataset demonstrate that our approach outperforms the state-of-the-art IVG methods in both perceptual quality and trajectory adherence.
Submission Number: 14
Loading