Keywords: video diffusion, corruption-aware training, robust video generation, structured noise injection, multimodal robustness, temporal coherence
TL;DR: We introduce CAT-Video, a corruption-aware training framework that improves robustness and temporal coherence in video diffusion models through structured, data-aligned noise injection.
Abstract: Latent Video Diffusion Models (LVDMs) have achieved state-of-the-art generative quality for image and video generation; however, they remain brittle under noisy conditioning, where small perturbations in text or multimodal embeddings can cascade over timesteps and cause semantic drift. Existing corruption strategies from image diffusion (Gaussian, Uniform) fail in video settings because static noise disrupts temporal fidelity. In this paper, we propose **CAT-Video**, a corruption-aware training framework with structured, data-aligned noise injection tailored for video diffusion. Our two operators—*Batch-Centered Noise Injection (BCNI)* and *Spectrum-Aware Contextual Noise (SACN)*—align perturbations with batch semantics or spectral dynamics to preserve coherence. CAT-Video yields substantial gains: BCNI reduces FVD by **31.9%** on WebVid-2M, MSR-VTT, and MSVD, while SACN improves UCF-101 by **12.3%**, outperforming Gaussian, Uniform, and even large diffusion baselines like DEMO (2.3B) and Lavie (3B) despite training on $\mathbf{5}\times$ less data. Ablations confirm the unique value of low-rank, data-aligned noise, and theory establishes why these operators tighten robustness and generalization bounds. CAT-Video thus sets a new framework for robust video diffusion, and our experiments show that it can also be extended to autoregressive generation and multimodal video understanding LLMs.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 19699
Loading