Humanoid Bimanual Dexterous Manipulation Driven by Egocentric Video

Published: 31 May 2026, Last Modified: 04 Jun 2026Beyond Teleop workshop, ICRA 2026 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dexterous manipulation, MANO hand, Scene reconstruction, World model
TL;DR: A humanoid robot learns bimanual dexterous manipulation largely from unlabeled egocentric videos, needing only a small amount of teleoperation data to adapt to real-world tasks.
Abstract: Bimanual dexterous manipulation lies at the core of human-level interaction with the physical world, enabling coordinated, contact-rich behaviors that single-arm systems cannot replicate. However, learning such skills for humanoid robots remains challenging: teleoperation—the dominant data-collection paradigm—requires expert operators and real-time execution, limiting the scale, diversity, and naturalness of available demonstrations. We introduce a label-free video-to-policy framework that reduces the need for teleoperation demonstrations by pretraining bimanual manipulation skills from large-scale, in-the-wild egocentric videos and then fine-tuning on a handful of teleoperation demonstrations. Our approach contains two key components: (1) an automatic wrist-finger pose extraction pipeline that reconstructs metrically consistent 3D wrist and fingertip trajectories from unconstrained egocentric videos by jointly recovering scene geometry and MANO-based hand kinematics; and (2) a temporal-scale world model that unifies visual dynamics prediction with action generation through temporal-scale autoregressive forecasting and sparse attention regularization. Experiments on real-world task evaluations using a Unitree H1-2 humanoid with dexterous hands demonstrate the ability to generalize across objects, backgrounds, and tasks
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 5
Loading