Abstract: We present Video Creation by Demonstration: given a demonstration video and an initial frame from any scene, we generate a realistic video that continues naturally from the initial frame and carries out the action concepts from the demonstration. This is important because unlike captions, camera poses, or point tracks, a demonstration video can provide detailed description of the target action without needing extensive manual annotations. The main challenge for training these models is the difficulty in curating supervised training data based on paired actions across different contexts. To mitigate this, we propose Delta-Diffusion, a self-supervised method that learns from unlabeled videos. Our key insight is that by placing a separately learned bottleneck on the features of a video foundation model, we can extract demonstration actions through these features and minimize degenerate solutions. We found Delta-Diffusion to outperform baselines in both human preference and large-scale machine evaluations.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Wilka_Torrico_Carvalho1
Submission Number: 5967
Loading