SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called \textbf{SegTalker} to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, mostly of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face videos. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative results on the HDTF dataset illustrate the superior performance of our method over existing methods on most metrics.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Experience] Multimedia Applications, [Content] Multimodal Fusion
Relevance To Conference: Audio-driven talking face generation significantly contributes to multimedia and multimodal processing by enhancing the realism of multimedia content and extending the application of multimedia content creation. In multimedia applications, such as digital human, virtual conference and video dubbing, audio-driven talking face generation enhances communication by providing synchronized visual segments with spoken words. Moreover, it facilitates the creation of personalized avatars, virtual assistants, and interactive characters, enriching user interactions in various multimedia environments. From the perspective of multimodal processing, audio-driven talking face generation integrates audio and visual modalities seamlessly to generate coherent and synchronized content. This fusion of modalities not only enhances the naturalness of synthesized content but also enables new applications in multimodal analysis, synthesis, and understanding. Unlike existing talking face generation techniques, our framework SegTalker effectively preserves intricate regional textures such as skin and teeth while synchronizing lip movements to input audio. By leveraging segmentation as an intermediate representation, it decouples lip movements from image textures. Overall, audio-driven talking face generation advances multimedia and multimodal processing by bridging the gap between audio and visual modalities, leading to more immersive and interactive multimedia experiences.
Supplementary Material: zip
Submission Number: 2594
Loading