Abstract: Video frame interpolation is a critical component of video streaming, a vibrant research area dealing with requests of both service providers and users. However, most existing methods cannot handle changing video resolutions while improving user perceptual quality. We aim to unleash the multifaceted knowledge yielded by the hierarchical views at multiple scales in a pyramid network. Specifically, we build a dual-view pyramid network by introducing pyramidal dual-view correspondence matching. It compels each scale to actively seek knowledge in view of both the current scale and a coarser scale, conducting robust correspondence matching by considering neighboring scales. Meanwhile, an auxiliary multi-scale collaborative supervision is devised to enforce the exchange of knowledge among current scale and a finer scale and thus reduce error propagation from coarse to fine scales. Based on the robust video dynamic caption of pyramidal dual-view correspondence matching, we further develop a pyramidal refinement module that formulates frame refinement as progressive latent representation generations by developing flow-guided cross-scale attention for feature fusion among neighboring frames. The proposed method achieves favorable performance on several benchmarks of varying video resolutions with better user perceptual quality and a relatively compact model size.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This paper is relevant to multimedia and multimodal processing in two folds. Firstly, this paper focuses on video frame interpolation, which is a critical video processing technique that has various applications in multimedia systems such as video compression, video on demand streaming and video editing. With existing techniques, user experience in these applications is still challenged in terms of both efficiency and efficacy due to the computationally intensive motion analysis and compensation, especially for varying video resolutions and complex scene dynamics. Secondly, the proposed dual-view pyramid network considers the multimodality of video data that involves interactions among spatial and temporal scene dynamics across multiple scales. Multimodality offers a valuable approach for analyzing spatially-temporally changing views and actions across video frames. Effective multimodal fusion by introducing pyramidal dual-view correspondence matching and flow-guided cross-scale attention, along with auxiliary multi-scale collaborative supervision, compel the proposed method to provide a more compact solution that achieves better user perceptual quality, thus has a potential to improve multimedia streaming experience for both user and service provider.
Submission Number: 4636
Loading