HOLD: Category-Agnostic 3D Reconstruction of Interacting Hands and Objects from Video

Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Xu Chen, Muhammed Kocabas, Michael J. Black, Otmar Hilliges

Published: 2024, Last Modified: 10 Nov 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most ex-isting methods for hand-object reconstruction from RGB ei-ther assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction settings. To address this, we introduce HOLD the first category-agnostic method that reconstructs an articulated hand and an object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can reconstruct disentangled 3D hands and ob-jects from 2D images. We also further incorporate hand-object constraints to improve hand-object poses and con-sequently the reconstruction quality. Our method does not rely on any 3D hand-object annotations while significantly outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we qualita-tively show its robustness in reconstructing from in-the-wild videos. See here for code, data, models, and updates.