Abstract: Recent advances in multimodal artificial intelligence have greatly improved the integration of vision-language-audio cues to enrich the content creation process. Inspired by these developments, in this paper, we first integrate audio into the face inpainting task to facilitate identity manipulation. Our main insight is that a person's voice carries distinct identity markers, such as age and gender, which provide an essential supplement for identity-aware face inpainting. By extracting identity information from audio as guidance, our method can naturally support tasks of identity preservation and identity swapping in face inpainting. Specifically, we introduce a dual-stream network architecture comprising a face branch and an audio branch. The face branch is tasked with extracting deterministic information from the visible parts of the input masked face, while the audio branch is designed to capture heuristic identity priors from the speaker's voice. The identity codes from two streams are integrated using a multi-layer perceptron (MLP) to create a virtual unified identity embedding that represennts comprehensive identity features. In addition, to explicitly exploit the information from audio, we introduce an audio-face generator to generate an 'fake' audio face directly from audio and fuse the multi-scale intermediate features from the audio-face generator into face inpainting network through an audio-visual feature fusion (AVFF) module. Extensive experiments demonstrate the positive impact of extracting identity information from audio on face inpainting task, especially in identity preservation.
Loading