TeMuDance: Zero-Shot Textual Control for Music-Driven Dance Generation

TeMuDance: Zero-Shot Textual Control for Music-Driven Dance Generation

ACL ARR 2026 January Submission6805 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: music-conditioned dance generation, text-guided controllable generation, diffusion models, cross-modal alignment

Abstract: Existing music-driven dance generation approaches demonstrate strong realism and effective alignment between audio and motion. However, they generally lack semantic controllability, making it difficult to guide specific movements through natural language descriptions. This limitation primarily stems from the absence of large-scale datasets that jointly align music, text, and motion, which prevents direct supervised learning of text-conditioned control. To address this challenge, we propose TeMuDance that enables zero-shot text-based control for music-conditioned dance generation. TeMuDance establishes a motion-centered bridging paradigm that aligns separate music-dance and text-motion datasets within a shared embedding space. Using motion as a pivot, we synthesize pseudo-triplets by retrieving and completing the missing modality for each corpus. Exploiting these synthesized priors, we train a text control branch that integrates semantic guidance into a frozen pretrained dance generation backbone, improving instruction compliance while preserving rhythmic consistency and motion realism. In addition, we introduce a motion-centered dual-stream fine-tuning strategy that jointly augments the two corpora and stabilises training in the presence of noisy pseudo annotations. Extensive experiments demonstrate that TeMuDance achieves competitive dance quality while substantially improving text-conditioned control over the existing methods.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, cross-modal content generation, generative models

Languages Studied: English

Submission Number: 6805

Loading