Sim2Real-HOI: Sim-to-Real HOI Video Generation via Decoupled Motion–Appearance Diffusion

Mingju Gao; Kaisen Yang; Huan-ang Gao; Bohan Li; Ao Ding; Wenyi Li; Yangcheng Yu; Shaocong Xu; Yike Niu; Haohan Chi; Hao Tang; Yu Zhang; Li Yi; Hao Zhao

Sim2Real-HOI: Sim-to-Real HOI Video Generation via Decoupled Motion–Appearance Diffusion

Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Tang, Yu Zhang, Li Yi, Hao Zhao

17 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Diffusion, Hand Object Interaction

TL;DR: We propose a novel sim2real HOI video transfer pipeline that generates realistic real-world videos using only hand and object poses and object mesh, overcoming limitations of current methods, and setting a new benchmark for HOI video generation.

Abstract: We present Sim2Real-HOI, a zero-shot framework that closes the sim-to-real gap for hand–object interaction (HOI) video generation using the initial and target poses of both hand and object. Controllable diffusion models like InterDyn and ManiVideo stumble at scale when moving simulation to reality: the quality of generated videos are suboptimal, and they rely on simulator-unobtainable cues such as the first frame. Sim2Real-HOI addresses the problem in two stages: (1) an appearance generator that models both appearance and background using a controllable image diffusion model, and (2) a motion transfer model that transfers motion, generated by a pretrained hand pose generator, to real-world video through a controllable video diffusion model. To improve performance, we incorporate multiple types of conditions that ensure the generated output aligns with the geometry, semantics, and fine details of the hand pose. Extensive experiments on DexYCB and OAKINK2 demonstrate that Sim2Real-HOI enhances the generated quality compared to the best prior work and results in a lower error rate when the generated videos are used to train downstream hand-pose estimators. The code and pre-trained weights will be made publicly available.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 8378

Loading