SingAvatar: High-fidelity Audio-driven Singing Avatar Synthesis

Published: 01 Jan 2024, Last Modified: 11 Nov 2024ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Generating photo-realistic avatars from audio plays an important role in extended reality (XR) and metaverse. In this paper, we lift the input audio from speech to singing, which has been rarely studied. The significant distinction between singing and talking poses great challenges for adapting talking face generation methods to the singing regime. To address this, we propose a high-fidelity singing avatar synthesis method called SingAvatar. Besides the audio, we incorporate vocal conditions involving phonemes and variance to alleviate the ambiguity of learning the singing-to-face mapping. Concretely, we tailor a two-stage pipeline: singing voice synthesis and portrait generation from the synthesized audio and auxiliary vocal conditions. Further, we curate a fine-grained singing head dataset containing singing videos with synchronized audio and accurate vocal conditions. In experiments, SingAvatar outperforms competing methods regarding audio-mouth synchronization, the naturalness of head movements, and controllability over the results. The code and dataset will be made publicly available.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview