DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos

Jhen Hsieh; Kuan-Hsun Tu; Kuo-Han Hung; Tsung-Wei Ke

DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos

Jhen Hsieh, Kuan-Hsun Tu, Kuo-Han Hung, Tsung-Wei Ke

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bimanual dexterous manipulation, visual demonstration learning, reinforcement learning

TL;DR: We present a pipeline that learns bimanual dexterous manipulation from a single uncalibrated human video and employs a contact prior reward to robustly handle noise and learn from implausible hand–object references.

Abstract: We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand–object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation. Video results are available at: https://dexman2026.github.io/

Primary Area: reinforcement learning

Submission Number: 4832

Loading