Keywords: Motion Generation, Human Intention Understanding, Motion Caption
Abstract: We propose MoHI, a motion generation framework that explicitly models human intention as the underlying cause of motion. By explicitly disentangling intention prediction from motion synthesis during training and jointly optimizing the two objectives, MoHI captures the motivational logic underlying human actions and provides clearer semantic guidance for coherent motion generation. Experiments on HumanML3D demonstrate state-of-the-art performance, with +4.5% improvement in R-Precision Top-1 and 38.6% lower FID over the state-of-the-art method. Fine-tuned on motion captioning, MoHI also outperforms recent LLM-based approaches, highlighting its unified strength in both motion understanding and generation.
Submission Number: 4
Loading