Universal Humanoid Robot Pose Learning from Internet Human Videos

Published: 07 May 2025, Last Modified: 07 May 2025ICRA Workshop Human-Centered Robot LearningEveryoneRevisionsBibTeXCC BY 4.0
Workshop Statement: Humanoid robots hold immense potential for human-centered applications, ranging from assistive care to collaborative workspaces. However, the scalability of humanoid learning remains a fundamental challenge due to the difficulty of collecting diverse, high-quality demonstrations. Our work, Humanoid-X and UH-1, directly addresses this limitation by leveraging vast amounts of in-the-wild human videos to develop a scalable, language-conditioned humanoid pose control model. By transforming natural human motion into robot-executable actions, our approach enables humanoid robots to learn from ubiquitous human demonstrations, bridging the gap between human movement and robotic embodiment. The foundation of our method is a universal action representation, learned from a large-scale dataset containing over 20 million humanoid actions paired with natural language descriptions. This approach aligns with human-centered robot learning by allowing intuitive, language-based interaction—enabling users to command humanoid robots using semantic instructions instead of manually engineered motion primitives. Our experiments demonstrate that training on massive human video datasets leads to superior generalization, allowing humanoid robots to perform diverse human-like actions with high fidelity in both simulated and real-world settings. This work represents a scalable and human-centric paradigm for learning adaptive, generalizable humanoid behaviors from human demonstrations.
Keywords: Humanoid Robot, Learning from Human Demonstrations
TL;DR: We present an approach for large-scale humanoid robot learning by leveraging Internet human videos
Abstract: Scalable learning of humanoid robots is crucial for their deployment in real-world applications. While traditional approaches primarily rely on reinforcement learning or teleoperation to achieve whole-body control, they are often limited by the diversity of simulated environments and the high costs of demonstration collection. In contrast, human videos are ubiquitous and present an untapped source of semantic and motion information that could significantly enhance the generalization capabilities of humanoid robots. This paper introduces Humanoid-X, a large-scale dataset of over 20 million humanoid robot poses with corresponding text-based motion descriptions, designed to leverage this abundant data. Humanoid-X is curated through a comprehensive pipeline: data mining from the Internet, video caption generation, motion retargeting of humans to humanoid robots, and policy learning for real-world deployment. With Humanoid-X, we further train a large humanoid model, UH-1, which takes text instructions as input and outputs corresponding actions to control a humanoid robot. Extensive simulated and real-world experiments validate that our scalable training approach leads to superior generalization in text-based humanoid control, marking a significant step toward adaptable, real-world-ready humanoid robots.
Submission Number: 3
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview