Realtime Video Frame Interpolation using One-Step Diffusion Sampling

ICLR 2026 Conference Submission324 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Frame Interpolation; Diffusion Models; Realtime Processing
Abstract: Recent research on video Frame Interpolation (VFI) shows that a pretrained Video Diffusion Model (VDM) can solve many challenging scenarios, including large or complex motion. However, VDMs require tedious diffusion sampling, making the inference slow. One possible way to accelerate is to distill a multi-step model into a one-step model, but additional modules are often introduced during distillation, which significantly increase training overhead. Instead, we propose a Real-time Diffusion-based Video Frame Interpolation pipeline, \method. \method achieves efficient interpolation by disentangling this task into two subproblems: motion and appearance generation. Specifically, \method first calculates pixel movements across frames with the continuous motion fields, only utilizing a few sparse key frames. As a result, \method only forwards the diffusion model for these sparse key frames rather than for each intermediate frame, effectively reducing one-step training cost. In the second appearance estimation step, \method then only needs to create intermediate frames by warping input frames with sampled optical flows from the estimated continuous motion field in the first step. Because our diffusion model creates motions only, it can work at a fixed and relatively small resolution, leading to superior training and inference efficiency. Extensive experiments show that our \method generates comparable or superior interpolation quality compared with existing multi-step solutions. It also offers outstanding inference efficiency, interpolating 17FPS at $1024\times 576$ resolution, achieving \textbf{50$\times$ acceleration} than the fastest diffusion-based generation by Wan.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 324
Loading