Abstract: Reconstructing 3D hand meshes from video files
is significantly challenging due to objects in the video often
occluding the hand during manipulation. These occlusions can
greatly reduce the quality of information extracted from the
obscured regions and decrease temporal hand coherence over
time. Existing approaches focus primarily on global occlusion
regions but overlook temporal hand coherence, which limits
their performance. Herein, we propose a novel framework
called IFVONet, designed to improve 3D hand mesh re-
construction by effectively capturing inter-frame variations
and improving the recovery of global occlusions. IFVONet
comprises three key components: (1) Pixel-Domain Variation
Module for identifying inter-frame variations across adjacent
frames, enhancing temporal hand coherence. (2) Enhanced
Global Occlusion Recovery Module for integrating hand infor-
mation into global occlusion representation, thereby improving
the accuracy of occlusion feature recovery. (3) Hand Regression
Module for dynamically aggregating hand information from
inter-frame variations and globally recovered occlusion features
into comprehensive hand representations, ultimately leading to
enhanced 3D hand reconstruction. Extensive experiments on
the HO3D-v2 and HO3D-v3 datasets demonstrate that our
proposed IFVONet achieves state-of-the-art performance on
both 3D hand mesh reconstruction and pose estimation.
Loading