Abstract: The growing demand for diverse and realistic character animations in video games and films has driven the development of natural language-controlled motion generation systems. While recent advances in text-driven 3D human motion synthesis have made significant progress, generating realistic multi-person interactions remains a major challenge. Existing methods, such as denoising diffusion models and autoregressive frameworks, have explored interaction dynamics using attention mechanisms and causal modeling. However, they consistently overlook a critical physical constraint: the explicit spatial distance between interacting body parts, which is essential for producing semantically accurate and physically plausible interactions. To address this limitation, we propose InterDist, a novel masked generative Transformer model operating in a discrete state space. Our key idea is to decompose two-person motion into three components: two independent, interaction-agnostic single-person motion sequences and a separate interaction distance sequence. This formulation enables direct learning of both individual motion and dynamic spatial relationships from text prompts. We implement this via a VQ-VAE that jointly encodes independent motions and relative distances into discrete codebooks, followed by a bidirectional masked generative Transformer that models their joint distribution conditioned on text. To better align motion and language, we also introduce a cross-modal interaction module to enhance text-motion association. Our approach ensures the generated motions exhibit both semantic alignment with textual descriptions and preserving plausible inter-character distances, setting a new benchmark for text-driven multi-person interaction generation.
External IDs:doi:10.1109/tvcg.2026.3651382
Loading