Keywords: human motion generation, diffusion model, human-human interaction
Abstract: Existing latent diffusion models excel at text-to-motion generation for single-person, but struggle with multi-person scenarios. To address this, we introduce Interaction Latent Diffusion (ILD) model. Unlike previous methods using the single-token latent space under geometric constraint, ILD leverages an interaction-aware, multi-token latent space that is enhanced by inter-person constraints and aligned with pretrained tokenizers, strengthening its expressibility. Building on ILD, we further improve the physical plausibility and ensure real-time inference by introducing two key components. Firstly, we propose an efficient neural collision guidance combined with high-order ODE solvers, avoiding the costly occupancy-based detection while reducing artifacts and latency. Secondly, we develop Flash ILD (FILD), a distilled model capable of one-step generation through a tailored consistency distillation and distribution matching pipeline. We evaluate the proposed ILD and FILD qualitatively and quantitatively on InterHuman and Inter-X datasets. Specifically, on the InterHuman dataset, ILD achieves a new state-of-the-art FID of 4.869 (vs. 5.154 for InterMask), meanwhile FILD accelerates inference from 10 FPS to 30 FPS. The code will be available.
Primary Area: generative models
Supplementary Material: zip
Submission Number: 2441
Loading