Audio-Driven Identity Manipulation for Face Inpainting

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in multimodal artificial intelligence have greatly improved the integration of vision-language-audio cues to enrich the content creation process. Inspired by these developments, in this paper, we first integrate audio into the face inpainting task to facilitate identity manipulation. Our main insight is that a person's voice carries distinct identity markers, such as age and gender, which provide an essential supplement for identity-aware face inpainting. By extracting identity information from audio as guidance, our method can naturally support tasks of identity preservation and identity swapping in face inpainting. Specifically, we introduce a dual-stream network architecture comprising a face branch and an audio branch. The face branch is tasked with extracting deterministic information from the visible parts of the input masked face, while the audio branch is designed to capture heuristic identity priors from the speaker's voice. The identity codes from two streams are integrated using a multi-layer perceptron (MLP) to create a virtual unified identity embedding that represennts comprehensive identity features. In addition, to explicitly exploit the information from audio, we introduce an audio-face generator to generate an `fake' audio face directly from audio and fuse the multi-scale intermediate features from the audio-face generator into face inpainting network through an audio-visual feature fusion (AVFF) module. Extensive experiments demonstrate the positive impact of extracting identity information from audio on face inpainting task, especially in identity preservation.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This innovative work integrates audio technology with facial inpainting for the first time, paving a new path in multimodal fusion. By extracting identity information from audio, it enhances the repair of visually impaired or extensively damaged facial images, thus improving face detection and recognition accuracy. This synergy leverages the consistency of identity traits across sensory modalities, where voice and facial features complement each other to enhance the authentication process. The technology enhances repair quality, boosts face recognition in security systems, offers cross-modal identity verification, and unlocks innovative applications in multimedia editing, virtual reality, and human-computer interaction. It not only expands the research in multimodal data processing but also paves the way for more efficient and accurate identity verification and facial reconstruction, holding substantial research value and broad application prospects.
Supplementary Material: zip
Submission Number: 2039
Loading