Keywords: Robot Manipulation, Humanoid, Foundation Model, Learn from Human
Abstract: Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper provides a recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. We further adopt domain adaptation techniques to align the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. For full reproducibility, we plan to release the dataset, base weights, and code upon acceptance.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 634
Loading