TMD-Bench: A Multi-Level Evaluation Paradigm for Music–Dance Co-Generation

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper introduces T2MD-Bench, a benchmark for text-driven music-dance co-generation.
Abstract: Unified audio--visual generation is rapidly gaining industrial and creative relevance, enabling applications in virtual production and interactive media. However, when moving from general audio--video synthesis to music–dance co-generation, the task becomes substantially harder: musical rhythm, phrasing, and accents must drive choreographic motion at fine temporal resolution, and such rhythmic coupling is not captured by unimodal metrics or generic audiovisual consistency scores used in current evaluation practice. We introduce TMD-Bench, a benchmark for text-driven music–dance co-generation that assesses systems across unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. The benchmark integrates computable physical metrics with perceptual multimodal judgments, and is supported by a curated rhythm-aligned music–dance dataset and a fine-grained Music Captioner for structured music semantics. TMD-Bench further reveals that (i) modern commercial audio--visual models (e.g., Veo 3, Sora 2) produce high-quality music and video, while rhythmic coupling remains less consistently optimized and leaves room for improvement, and (ii) our unified baseline RhyJAM trained on rhythm-aligned data achieves competitive beat-level synchronization while maintaining competitive unimodal fidelity. This presents prospects for building next-generation music–dance models that explicitly optimize rhythmic and kinetic coherence. The code is available at https://github.com/MM-Speech/TMD-Bench.
Lay Summary: Generating a video of someone dancing with matching music is harder than simply making good music and a good video separately. A convincing dance performance needs the body movements to follow the rhythm, beats, and expressive changes in the music. Current evaluation methods for audio-video generation often check whether the sound and image generally match, but they do not carefully measure whether the dance is truly synchronized with the music. This paper introduces TMD-Bench, a benchmark for evaluating text-driven music-and-dance generation. It looks at three main aspects: whether the generated music and video are high quality, whether they follow the text prompt, and whether the dance movements match the rhythm of the music. To make this possible, the paper combines automatic measurements, such as beat and motion timing, with judgment from multimodal models that approximate human perception. The paper also builds a curated dataset of rhythm-aligned music-dance examples and introduces RhyJAM, a baseline model that generates music and dance together. Results show that even strong commercial systems can make impressive music and video, but still have room to improve in beat-level dance-music synchronization.
Link To Code: https://github.com/MM-Speech/TMD-Bench
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: text-to-audio and video, av-alignment
Originally Submitted PDF: pdf
Submission Number: 13945
Loading