Dance Video Generation using Music-to-Pose Encoder Trained on Synthetic Dataset Generation Pipeline leveraging Latent Diffusion Framework

Published: 07 Aug 2025, Last Modified: 20 Aug 2025Gen4AVC PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dance Video Generation, Music-to-Pose Encoder, Synthetic Dataset Generation
Abstract: We present a framework for generating human dance videos conditioned on music and a reference image. A Music-to-Pose Encoder (M2PEnc), trained with a Synthetic Dataset Generation Pipeline (SDGPip), maps musical features into structured 3D pose parameters, ensuring precise rhythm–motion alignment. These pose sequences condition a latent diffusion model (LDM) with multi-level attention to synthesize motions that are rhythmically synchronized, visually coherent, and faithful to the reference subject. Extensive benchmark evaluations demonstrate state-of-the-art performance and strong generalization across subjects, styles, and genres. Comprehensive ablation studies confirm the contributions of each component, and a user study verifies the naturalness and expressiveness of the generated dances. Together, these results underscore the robustness and effectiveness of the proposed approach.
Submission Number: 5
Loading