Learning Text-driven 3D Human Motion Generation from 3D-free Web Videos

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human Motion Generation
Abstract: Text-driven 3D human motion generation has gained attention for synthesizing complex movements from textual descriptions. Traditional approaches depend on expensive 3D motion capture, which restricts motion diversity, whereas 2D human videos provide abundant and accessible data. However, the absence of large-scale annotated 2D motion datasets and the challenge of generating 3D motion from 2D data remain unresolved. To address this, we introduce MotionWeb, a dataset comprising over 100k motion clips, 17 million frames, and 160 hours of data, with 2D keypoints extracted using state-of-the-art pose estimation models, significantly reducing annotation costs. We further propose Keypoint To Motion (K2M), an efficient framework for text-driven 3D motion generation leveraging 2D supervision without requiring 3D annotations. Experiments show that our method efficiently generates realistic 3D motion with improved both quality and diversity using large-scale 2D supervision.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15502
Loading