Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation

Published: 02 Mar 2026, Last Modified: 29 Mar 2026ReALM-GEN 2026 - ICLR 2026 WorkshopEveryoneRevisionsCC BY 4.0
Keywords: text-video diffusion, corruption-aware training, structured noise injection, multimodal robustness, temporal coherence
Abstract: Latent Video Diffusion Models (LVDMs) have achieved state-of-the-art generative quality for image and video generation; however, they remain brittle under noisy conditioning, where small perturbations in text or multimodal embeddings can cascade over timesteps and cause semantic drift. Existing corruption strategies from image diffusion (Gaussian, Uniform) fail in video settings because static noise disrupts temporal fidelity. In this paper, we propose CAT-Video, a corruption-aware training framework with structured, data-aligned noise injection tailored for video diffusion. Our two operators—Batch-Centered Noise Injection (BCNI) and Spectrum-Aware Contextual Noise (SACN)—align perturbations with batch semantics or spectral dynamics to preserve coherence. CAT-Video yields substantial gains: BCNI reduces FVD by 31.9% on WebVid-2M, MSR-VTT, and MSVD, while SACN improves UCF-101 by 12.3%, outperforming Gaussian, Uniform, and even large diffusion baselines like DEMO (2.3B) and Lavie (3B) despite training on 5x less data. Ablations confirm the unique value of low-rank, data-aligned noise, and theory establishes why these operators tighten robustness and generalization bounds. CAT-Video thus sets a new framework for robust video diffusion, and our experiments show that it can also be extended to autoregressive generation and multimodal video understanding LLMs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 14
Loading