Abstract: Most video frame interpolation (VFI) algorithms infer the intermediate frame with the help of adjacent frames through the cascaded motion estimation and content refinement.However, the intrinsic correlations between motion and content are barely investigated, commonly producing interpolated results with inconsistency and blurry contents.Specifically, we first discover a simple yet essential domain knowledge that contents and motions characteristics should be homogeneous to a certain degree from the same objects, and formulate the consistency into the loss function for model optimization. Based on this, we propose to learn the collaborative representation between motions and contents, and construct a novel progressive spatial-temporal Collaborative network (Prost-Net) for video frame interpolation.Specifically, we develop a content-guided motion module (CGMM) and a motion-guided content module (MGCM) for individual content and motion representation. In particular, the predicted motion in CGMM is used to guide the fusion and distillation of contents for intermediate frame interpolation, and vice versa. Furthermore, by considering collaborative strategy in a multi-scale framework, our Prost-Net progressively optimizes motions and contents in a coarse-to-fine manner, making it robust to various challenging scenarios (occlusion and large motions) in VFI. Extensive experiments on the benchmark datasets demonstrate that our method significantly outperforms state-of-the-art methods.
0 Replies
Loading