Flowing From Observed To Future Frames For Efficient Video Prediction

17 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video prediction, flow matching, inherent optimal couplings, target inversion, video prediction via generative models
TL;DR: We introduce fast and memory-efficient video prediction method by directly flowing from observed to future frames.
Abstract: This paper introduces a novel methodology for fast and memory-efficient video prediction. Our method, dubbed FlowFrames, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the observed and future frame distributions. Two design choices are key. First, we introduce inherent optimal couplings, utilizing consecutive video chunks during training as a practical proxy for optimal couplings, which results in straighter flows. Second, we incorporate target inversion, injecting the inverted latent of the target chunk into the input representation to strengthen correspondences and improve visual fidelity. By flowing directly from observed to future frames, instead of the common combination of input frames with noise to generate future frames, we reduce the dimensionality of the model input by a factor of two. The proposed method, fine-tuned from LTXV and Wan, surpasses the state-of-the-art scores across quantitative evaluations with FID and FVD, with as few as five neural function evaluations. We will release the code and models of our method to the public.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9465
Loading