Unified Multi-Modal Interactive and Reactive 3D Motion Generation via Rectified Flow

Unified Multi-Modal Interactive and Reactive 3D Motion Generation via Rectified Flow

ICLR 2026 Conference Submission20941 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-modal, Generative Modeling, Interactive Motion, Reactive Motion, Rectified Flow

TL;DR: Unified Multi-Modal Interactive and Reactive 3D Motion Generation via Rectified Flow

Abstract: Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. A contrastive objective further strengthens alignment with conditioning signals, while a synchronization loss improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 20941

Loading