Dance Video Generation using Music-to-Pose Encoder Trained on Synthetic Dataset Generation Pipeline leveraging Latent Diffusion Framework
Keywords: Dance Video Generation, Music-to-Pose Encoder, Synthetic Dataset Generation
Abstract: We present a framework for generating human dance videos conditioned on music and a reference image. Our approach introduces a Music-to-Pose Encoder (M2PEnc) trained with a Synthetic Dataset Generation Pipeline (SDGPip) to map audio features into structured 3D pose parameters, which serve as input to a latent diffusion model(LDM). This enables the creation of rhythmically synchronized and visually coherent dance animations by leveraging multi-level attention mechanisms in LDM. The proposed framework achieves state-of-the-art performance across benchmark datasets, demonstrating robust generalization to diverse individuals and dance genres.
Submission Number: 5
Loading