Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

ICLR 2026 Conference Submission13206 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Human Motion Generation; Two-person Motion Generation

Abstract: Generating realistic and diverse human-human interactions from text is a crucial yet challenging task in computer vision, graphics, and robotics. Despite recent advances, existing methods have two key limitations. First, two-person interaction synthesis is highly complex, simultaneously requiring individual human motion quality and spatial-temporal sync between the interactants. However, due to their limited scale, the current datasets cannot effectively support learning such a complex task, restricting the model's generalizing capabilities. To address this, we propose a scalable data synthesis framework, InterCompose, which leverages the general knowledge encoded in large language models and the motion priors from strong single-person generators to synthesize high-quality two-person interactions novel to the original data distribution. Second, accurately describing the intricacies of two-person motions often requires text of comparable complexity, and modeling such texts with a single sentence-level vector inevitably causes information loss. For a finer modeling of interaction semantics, we further propose Text2Interact, which features an attention-based word-level conditioning module, improving fine-grained text-motion alignment. Meanwhile, we introduce an adaptive interaction supervision signal that dynamically weighs body parts based on the interaction context, enhancing interaction realism. We conduct extensive experiments to validate the effectiveness of our proposed data synthesis and word-level conditioning pipeline. Compared to state-of-the-art models, our approach significantly enhances motion diversity, text-motion alignment, and motion realism. The code and trained models will be released for reproducibility.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 13206

Loading