Causally Steered Diffusion for Video Counterfactual Generation

ICLR 2026 Conference Submission20489 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: counterfactual generation, causality, generative AI, diffusion models, VLMs
TL;DR: A causal framework for counterfactual video generation, guided by a vision-language model (VLM)
Abstract: Adapting text-to-image (T2I) latent diffusion models (LDMs) to video editing has shown strong visual fidelity and controllability, but challenges remain in maintaining causal relationships inherent to the video data generating process. In this work, we propose CSVC, a framework for counterfactual video generation grounded in structural causal models (SCMs) and formulated as an out-of-distribution (OOD) prediction task. CSVC builds on black-box counterfactual functions, which approximate SCM mechanisms without explicit structural equations. In our framework, large language models (LLMs) generate counterfactual prompts that are consistent with a predefined causal graph, while LDM-based video editors produce the corresponding video counterfactuals. To ensure faithful interventions, we introduce a vision–language model (VLM)-based textual loss that refines prompts to enforce counterfactual conditioning, steering the LDM latent space toward causally meaningful OOD variations without internal model access or fine-tuning. Experiments on real-world facial videos show that CSVC achieves state-of-the-art causal effectiveness while preserving temporal consistency and visual quality. By combining SCM reasoning with black-box generative models, CSVC enables realistic “what if” hypothetical video scenarios with applications in digital media and healthcare.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20489
Loading