Motion-aware Latent Diffusion Models for Video Frame Interpolation

Zhilin Huang; Yijie Yu; Ling Yang; Chujun Qin; Bing Zheng; Xiawu Zheng; Zikun Zhou; Yaowei Wang; Wenming Yang

Motion-aware Latent Diffusion Models for Video Frame Interpolation

Zhilin Huang, Yijie Yu, Ling Yang, Chujun Qin, Bing Zheng, Xiawu Zheng, Zikun Zhou, Yaowei Wang, Wenming Yang

Published: 20 Jul 2024, Last Modified: 04 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, motion-aware latent diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Generation] Generative Multimedia

Relevance To Conference: This work contributes to multimedia and multimodal processing by advancing the field of video frame interpolation (VFI), which is a critical component in various applications that involve the manipulation and understanding of visual data. For example, this work can: (1) improve video quality and further enhance the video analytics; (2) be leveraged in video compression algorithms to reduce the need for storing every single frame, thereby saving storage space and bandwidth while maintaining perceived visual quality; (3) reducing latency and improving the user experience in the AR and VR by synthesizing additional frames to bridge the gap between real-time captured frames.

Supplementary Material: zip

Submission Number: 603

Loading