Music-driven Character Dance Video Generation based on Pre-trained Diffusion Model

Jiayu Xu, Changhong Liu, Juan Cai, Ji Ye, Zhenchun Lei, Aiwen Jiang

Published: 01 Jan 2024, Last Modified: 14 Nov 2024IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large-scale pre-trained models have shown significant progress in cross-modal generation tasks, especially in the text-to-image generation task. However, the pre-trained models for audio-guided video are rare. ControlNet [1] provides a new architecture to enhance the pre-trained diffusion models with task-specific conditions. Following the ControlNet [1] architecture, we propose a music-driven character dance video generation model based on the pre-trained diffusion model by taking the text prompt, music, and character image as the additional guidance conditions to generate dance videos. In this model, multimodal semantic correspondence between text, music, and video is exploited to generate character dance videos better by incorporating the pre-trained CLIP [2] and Wav2CLIP [3] models. Additionally, we design a text prompt to improve the appearance quality of the generated character images. Extensive experiments on the AIST++ dataset show the effectiveness of our method and its ability to generate character dance videos effectively.