DanceTogether: Generating Interactive Multi-Person Video without Identity Drifting

ICLR 2026 Conference Submission1215 Authors

03 Sept 2025 (modified: 29 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Controllable video generation, Multi-person Interactive Video Generation, Multi-person Pose Estimation, Multi-object Tracking
TL;DR: DanceTogether is an end-to-end diffusion framework for controllable multi-actor video generation with identity preservation, enabled by a novel adapter, new datasets, and a benchmark supporting human–robot and cross-domain scenarios.
Abstract: Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds “who” and “how” at every denoising step by fusing robust tracking masks with semantically rich but noisy pose heat maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 h of dual-skater footage with more than 7 000 distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centred on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalisation to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 1215
Loading