Keywords: 3D reconstruction; Human-object interaction; robotics
Abstract: Reconstructing 4D human-object interaction from monocular RGB video would make
large-scale interaction capture possible outside controlled studios, but the task is
ill posed: object geometry is unknown, depth and scale are ambiguous, and contacts
are often heavily occluded.
We present CARI4D, a category-agnostic framework that reconstructs a metric-scale
object, estimates temporally consistent human and object motion, and reasons about
hand-object contact from a single RGB video.
CARI4D combines foundation models for object generation, metric depth, human pose,
and object pose, but explicitly aligns their outputs through coarse-to-fine scale
selection, pose-hypothesis filtering, learned render-and-compare contact reasoning,
and contact-aware joint optimization.
On BEHAVE and zero-shot InterCap evaluations, CARI4D improves combined
reconstruction Chamfer distance by more than 35\% over prior video-based baselines,
while also generalizing to in-the-wild videos with previously unseen object categories.
Submission Number: 7
Loading