CARI4D: Category-Agnostic 4D Reconstruction of Human-Object Interaction

Xianghui Xie; Bowen Wen; Stan Birchfield

CARI4D: Category-Agnostic 4D Reconstruction of Human-Object Interaction

Xianghui Xie, Bowen Wen, Stan Birchfield

Published: 27 May 2026, Last Modified: 27 May 2026H2REveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D reconstruction; Human-object interaction; robotics

Abstract: Reconstructing 4D human-object interaction from monocular RGB video would make large-scale interaction capture possible outside controlled studios, but the task is ill posed: object geometry is unknown, depth and scale are ambiguous, and contacts are often heavily occluded. We present CARI4D, a category-agnostic framework that reconstructs a metric-scale object, estimates temporally consistent human and object motion, and reasons about hand-object contact from a single RGB video. CARI4D combines foundation models for object generation, metric depth, human pose, and object pose, but explicitly aligns their outputs through coarse-to-fine scale selection, pose-hypothesis filtering, learned render-and-compare contact reasoning, and contact-aware joint optimization. On BEHAVE and zero-shot InterCap evaluations, CARI4D improves combined reconstruction Chamfer distance by more than 35\% over prior video-based baselines, while also generalizing to in-the-wild videos with previously unseen object categories.

Submission Number: 7

Loading