Abstract: Portrait video editing has attracted wide attention thanks to its practical applications. Existing methods either target fixed-length clips or perform temporally inconsistent per-frame editing. In this work, we present a brand new system, StreamEdit, which is primarily designed to edit streaming videos. Our system follows the ideology of editing propagation to ensure temporal consistency. Concretely, we choose to edit only one reference frame and warp the outcome to obtain the editing results of other frames. For this purpose, we employ a warping module, aided by a probabilistic pixel correspondence estimation network, to help establish the pixel-wise mapping between two frames. However, such a pipeline requires the reference frame to contain all contents appearing in the video, which is scarcely possible especially when there exist large motions and occlusions. To address this challenge, we propose to adaptively replace the reference frame, benefiting from a heuristic strategy referring to the overall pixel mapping uncertainty. That way, we can easily align the editing of the before- and after-replacement reference frames via image inpainting. Extensive experimental results demonstrate the effectiveness and generalizability of our approach in editing streaming portrait videos. Code will be made public.
Primary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This work makes a significant contribution to multimedia/multimodal processing by addressing the challenging problem of editing streaming portrait videos in a temporally consistent manner. Temporally consistent video editing is crucial for maintaining visual coherence and realism throughout the video sequence. Existing methods either focus on fixed-length clips or perform per-frame editing, which can lead to temporal inconsistencies.
The proposed system, StreamEdit, introduces a novel approach to achieve temporal consistency in streaming video editing. By following the ideology of editing propagation, the system selects a single reference frame and warps the editing results to other frames. This ensures that the edited content seamlessly integrates with the rest of the video. To establish accurate pixel-wise mapping between frames, a warping module aided by a probabilistic pixel correspondence estimation network is employed.
One of the key contributions of this work is the adaptive replacement of the reference frame. The authors propose a heuristic strategy based on overall pixel mapping uncertainty to identify frames where the reference frame needs to be replaced. This addresses the challenge of large motions and occlusions that may prevent the reference frame from containing all the contents appearing in the video. The replacement is facilitated by image inpainting techniques, allowing for coherent editing across frames.
The effectiveness and generalizability of the proposed approach are demonstrated through extensive experiments. By addressing the specific problem of editing streaming portrait videos with temporal consistency, this work advances the field of multimedia/multimodal processing, offering new possibilities for real-time video editing applications. The planned release of the code further promotes reproducibility and facilitates future research and development in this area.
Supplementary Material: zip
Submission Number: 2200
Loading