Harmony Everything! Masked Autoencoders for Video Harmonization

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video harmonization aims to address the discrepancy in color and lighting between foreground and background elements within video compositions, thereby enhancing the innate coherence of composite video content. Nevertheless, existing methods struggle to effectively handle video composite tasks with excessively large-scale foregrounds. In this paper, we propose Video Harmonization Masked Autoencoders (VHMAE), a simple yet powerful end-to-end video harmonization method designed to tackle this challenge once and for all. Unlike other typically MAE-based methods employing random or tube masking strategies, we innovative treat all foregrounds in each frame required for harmonization as prediction regions, which are designated as masked tokens and fed into our network to produce the final refinement video. To this end, the network is optimized to prioritize the harmonization task, proficiently reconstructing the masked region despite the limited background information. Specifically, we introduce the Pattern Alignment Module (PAM) to extract content information from the extensive masked foreground region, aligning the latent semantic features of the masked foreground content with the background context while disregarding the impact of various colors or illumination. Moreover, We propose the Patch Balancing Loss, which effectively mitigates the undesirable grid-like artifacts commonly observed in MAE-based approaches for image generation, thereby ensuring consistency between the predicted foreground and the visible background. Additionally, we introduce a real-composited video harmonization dataset named RCVH, which serves as a valuable benchmark for assessing the efficacy of techniques aimed at video harmonization across different real video sources. Comprehensive experiments demonstrate that our VHMAE outperforms state-of-the-art techniques on both our RCVH and the publicly available HYouTube dataset.
Primary Subject Area: [Content] Media Interpretation
Relevance To Conference: The prevalence of fast-paced multimedia platforms like Meta and TikTok has sparked a significant focus on video editing, specifically in the fundamental task of video composition, intending to integrate two unrelated videos seamlessly. However, variations in shooting environments or equipment can lead to discrepancies in color and lighting between the two videos. Consequently, the resulting composite may appear unrealistic when merging contents from these videos. To address this issue, video harmonization was introduced, to seamlessly integrate it with the background, achieving a more realistic and delightful composition results. In this paper, we propose Video Harmonization Masked Autoencoders (VHMAE), a simple yet powerful end-to-end video harmonization method. VHMAE is specifically engineered to address the complexities of video composition tasks, particularly those involving excessively large inharmonic foregrounds. Our main contributions can be summarized as follows: · We introduce a practical setting for video harmonization with large foregrounds, and propose VHMAE, which, to the best of our knowledge, is the first end-to-end MAE-based model for video harmonization. · We devise two key and innovative modules for VHMAE, i.e., Pattern Alignment Module (PAM) for aligning semantic information between foreground and background and preventing disharmony, and Patch Balancing Loss reduces grid-like artifacts in the output caused by split patches. · We present a new and practical dataset of real-composited video harmonization dataset called RCVH. Extensive experiments on several benchmarks indicate the effectiveness and superior performance of our VHMAE.
Supplementary Material: zip
Submission Number: 57
Loading