MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human Motion Generation, Intention Prediction, Generative Model
Abstract: Existing text-driven motion generation methods primarily focus on bidirectional mapping between language and motion, yet they often struggle to capture the high-level semantic structures and future behavior patterns that govern how actions unfold. Moreover, the absence of visual conditioning limits synthesis accuracy, as language alone cannot specify fine-grained spatiotemporal trajectories or environmental context. We present MoGIC, a unified multimodal framework that jointly models future-aware behavior understanding and multimodal-conditioned motion generation. MoGIC formulates future-behavior predicting as inferring high-level future semantic patterns from partial observations, while leveraging visual priors to resolve ambiguities inherent in text-only conditioning. We further introduce a mixture-of-attention mechanism with adaptive scope to facilitate effective interactions between multimodal tokens and temporal motion segments, thereby mitigating the impact of non-strict timing alignment. To support this paradigm, we curate Mo440H, a 440-hour tri-modal benchmark aggregated from 21 high-quality motion datasets. Extensive experiments demonstrate substantial improvements in generation fidelity and multimodal versatility of MoGIC: (1) a 36\% reduction in FID on HumanML3D and Mo440H; (2) superior captioning performance compared to LLM-based methods using only a lightweight text head; (3) capabilities in future-aware behavior prediction and vision-conditioned motion synthesis. Together, these results advance the state of the art in motion understanding and multi-conditioned generation.
Supplementary Material: pdf
Primary Area: generative models
Submission Number: 18944
Loading