Subject-driven Video Generation Emerges from Experience Replays

Daneul Kim; Jingxu Zhang; Wonjoon Jin; Sunghyun Cho; Qi Dai; Jaesik Park; Chong Luo

Subject-driven Video Generation Emerges from Experience Replays

Daneul Kim, Jingxu Zhang, Wonjoon Jin, Sunghyun Cho, Qi Dai, Jaesik Park, Chong Luo

17 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video generation, customization, personalization, diffusion models, continual learning

TL;DR: We employ continual learning with video replay, and adjust replay ratio dynamically, achieving on-par subject fidelity and motion with lower compute compared to state-of-the-art models.

Abstract: We aim to enable efficient subject-to-video (S2V) learning, which otherwise requires expensive video-subject-pair datasets that require tens of thousands of GPU hours for training. While utilizing image-paired datasets to train video models could address this challenge, naively training with image pairs results in catastrophic loss of temporal ability due to gradient conflicts. We hypothesize that S2V generation decomposes into two orthogonal objectives of identity learning from images and temporal dynamics from videos. Based on this orthogonality assumption, we design a stochastic task-switching strategy that predominantly samples from image datasets while maintaining minimal video replay for temporal coherence. Our experiments validate this hypothesis by demonstrating that the gradient inner product between tasks converges exponentially to near-zero, confirming emergent orthogonalization without requiring explicit orthogonal projection. This validated orthogonality enables efficient image-dominant training while preventing catastrophic forgetting through proxy experience replay. We employ regularization techniques including random frame selection and token dropping during video replay to ensure efficient temporal learning. Extensive experiments demonstrate our approach achieves superior performance with comparable compute to per-subject tuned methods for single subjects, while providing zero-shot capability and outperforming both per-subject tuned methods and some existing zero-shot approaches.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 8758

Loading