STM2PE-Diff : Synthetically Trained Music-to-Pose Encoder Diffusion for Automated Choreography Generation
Keywords: music to dance motion synthetic data generation, application to music-to-dance generation
Abstract: Automated choreography generation, which aims to seamlessly harmonize human movements with music, is a multifaceted challenge demanding both technical precision and artistic expressiveness.
We present STM2PE-Diff, a novel framework for generating human dance videos conditioned on a refernce image and music sequence using a latent diffusion model. Our approach integrates a Music-to-Pose Encoder (M2PEnc), trained with a novel synthetic dataset generation pipeline (SDGPip), which maps audio features into structured 3D pose and shape parameters that capture human geometry and dynamic motion patterns synchronized with musical input. By combining these encoded parameters with a reference image through a multi-level attention mechanism within the latent diffusion framework, we synthesize visually coherent and rhythmically synchronized dance animations of individuals from the given reference image.
Experiments on benchmark datasets demonstrate that STM2PE-Diff achieves state-of-the-art performance, producing high-quality dance videos that accurately reflect pose diversity and temporal consistency. Additionally, our method exhibits robust generalization capabilities, validated by its strong performance on a newly introduced in-the-wild dataset.
Submission Number: 70
Loading