Keywords: Video Diffusion, Hand Object Interaction
TL;DR: We propose a novel sim2real HOI video transfer pipeline that generates realistic real-world videos using only hand and object poses and object mesh, overcoming limitations of current methods, and setting a new benchmark for HOI video generation.
Abstract: We present Sim2Real-HOI, a zero-shot framework that closes the sim-to-real gap for hand–object interaction (HOI) video generation using the initial and target poses of both hand and object. Controllable diffusion models like InterDyn and ManiVideo stumble at scale when moving simulation to reality: the quality of generated videos are suboptimal, and they rely on simulator-unobtainable cues such as the first frame. Sim2Real-HOI addresses the problem in two stages: (1) an appearance generator that models both appearance and background using a controllable image diffusion model, and (2) a motion transfer model that transfers motion, generated by a pretrained hand pose generator, to real-world video through a controllable video diffusion model. To improve performance, we incorporate multiple types of conditions that ensure the generated output aligns with the geometry, semantics, and fine details of the hand pose. Extensive experiments on DexYCB and OAKINK2 demonstrate that Sim2Real-HOI enhances the generated quality compared to the best prior work and results in a lower error rate when the generated videos are used to train downstream hand-pose estimators. The code and pre-trained weights will be made publicly available.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 8378
Loading