Dress&Dance: Dress up and Dance as You Like It

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Diffusion Model, Virtual Try-On, Generative Model
TL;DR: We generate virtual try-on videos at high-resolution (1152x720), high-FPS (24 FPS), and high-quality.
Abstract: We present Dress&Dance, a video diffusion framework that generates high-quality virtual try-on videos of users wearing desired garments while performing complex motions. Our approach is the first to achieve high-resolution (1152×720) at high-FPS (24 FPS), with support for various try-on modes, including simultaneous try-on of tops and bottoms. At the core of our framework is CondNets, a novel attention-based conditioning architecture that unifies heterogeneous multi-modal inputs – including garment images, user images, motion videos, and text prompts – into a single homogeneous token sequence. To prevent strong pre-trained text priors from overshadowing garment inputs, we introduce garment-aware target steering, a guidance mechanism that enforces accurate garment placement. To further address both data scarcity and computational demands of training high-quality video models, we propose a synthetic triplet generation strategy for producing paired training data and a multi-stage training curriculum that progressively scales resolution and frame rate. Our framework outperforms existing open-source and commercial solutions, enabling flexible, high-quality try-on experiences that faithfully preserve garment details, user identity, and complex motions.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 5525
Loading