Abstract: Capturing the interactions between humans and their
environment in 3D is important for many applications in
robotics, graphics, and vision. Recent works to reconstruct
the 3D human and object from a single RGB image do not
have consistent relative translation across frames because
they assume a fixed depth. Moreover, their performance
drops significantly when the object is occluded. In this
work, we propose a novel method to track the 3D human,
object, contacts, and relative translation across frames from
a single RGB camera, while being robust to heavy occlusions. Our method is built on two key insights. First, we
condition our neural field reconstructions for human and
object on per-frame SMPL model estimates obtained by
pre-fitting SMPL to a video sequence. This improves neural reconstruction accuracy and produces coherent relative
translation across frames. Second, human and object motion from visible frames provides valuable information to
infer the occluded object. We propose a novel transformerbased neural network that explicitly uses object visibility and human motion to leverage neighboring frames to
make predictions for the occluded frames. Building on
these insights, our method is able to track both human
and object robustly even under occlusions. Experiments
on two datasets show that our method significantly improves over the state-of-the-art methods. Our code and pretrained models are available at: https://virtualhumans.mpiinf.mpg.de/VisTracker.
0 Replies
Loading