Abstract: Video generation and editing, particularly human-centric video editing, has seen a surge of interest in its potential to create immersive and dynamic content. A fundamental challenge is ensuring temporal coherence and visual harmony across frames, especially in handling large-scale human motion and maintaining consistency over long sequences. The previous methods, such as diffusion-based video editing, struggle with flickering and length limitations. In contrast, methods employing Video-2D representations grapple with accurately capturing complex structural relationships in large-scale human motion. Simultaneously, some patterns on the human body appear intermittently throughout the video, posing a knotty problem in identifying visual correspondence. To address the above problems, we present HeroMaker. This human-centric video editing framework manipulates the person's appearance within the input video and achieves inter-frame consistent results. Specifically, we propose to learn the motion priors, transformations from dual canonical fields to each video frame, by leveraging the body mesh-based human motion warping and neural deformation-based margin refinement in the video reconstruction framework to ensure the semantic correctness of canonical fields. HeroMaker performs human-centric video editing by manipulating the dual canonical fields and combining them with motion priors to synthesize temporally coherent and visually plausible results. Comprehensive experiments demonstrate that our approach surpasses existing methods regarding temporal consistency, visual quality, and semantic coherence.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This work contributes significantly to multimedia/multimodal processing by addressing key challenges in human-centric video editing, particularly ensuring temporal coherence and visual harmony across frames. This article attempts to compress the video modality into the image modality using the 3D motion priors. Simultaneously utilizing existing foundation diffusion models to edit images through text, then achieving video editing. By utilizing information from different modalities to transform a more complex task into a simpler one, it reflects the complementarity and transformation of information between different modalities. At the same time, we explored how to utilize existing multimodal foundation models to complete downstream tasks.
Supplementary Material: zip
Submission Number: 2755
Loading