Video Unlearning via Low-Rank Refusal Vector

Published: 03 Mar 2026, Last Modified: 07 Apr 2026ICLR 2026 DeLTa Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video, generation, safety, unlearning, text-to-video, diffusion, wan, sora, opensora, refusal, steering
TL;DR: We introduce a training-free weight update for video diffusion models that permanently removes unsafe concepts using refusal vectors, achieving effective censorship without retraining or degrading video quality.
Abstract: Video generative models achieve high-quality synthesis from natural-language prompts by leveraging large-scale web data. However, this training paradigm inherently exposes them to unsafe biases and harmful concepts, introducing the risk of generating undesirable or illicit content. To mitigate unsafe generations, existing machine unlearning approaches either rely on filtering, and can therefore be bypassed, or they update model weights, but with costly fine-tuning or training-free closed-form edits. We propose the first training-free weight update framework for concept removal in video diffusion models. From five paired safe/unsafe prompts, our method estimates a refusal vector and integrates it into the model weights as a closed-form update. A contrastive low-rank factorization further disentangles the target concept from unrelated semantics, it ensures a selective concept suppression and it does not harm generation quality. Our approach reduces unsafe generations on the Open-Sora and ZeroScopeT2V models across the T2VSafetyBench and SafeSora benchmarks, with average reductions of 36.3\% and 58.2\% respectively, while preserving prompt alignment and video quality. This establishes an efficient and scalable solution for safe video generation without retraining nor any inference overhead.
Submission Number: 74
Loading