Abstract: Estimating the 3D poses of interacting hands from a monocular image is challenging due to the similarity in appearance between hand parts. Therefore, utilizing the appearance features alone tends to result in unreliable pose estimation. Existing approaches directly fuse the appearance features with position features, ignoring that the two types of features are heterogeneous. Here, the appearance features are derived from the RGB values of pixels, while the position features are mapped from the coordinates of pixels or joints. To address this problem, we present a novel framework called \textbf{D}ecoupled \textbf{F}eature \textbf{L}earning (\textbf{DFL}) for 3D pose estimation of interacting hands. By decoupling the appearance and position features, we facilitate the interactions within each feature type and those between both types of features. First, we compute the appearance relationships between the joint queries and the image feature maps; we utilize these relationships to aggregate each joint's appearance and position features. Second, we compute the 3D spatial relationships between hand joints using their position features; we utilize these relationships to guide the feature enhancement of joints. Third, we calculate appearance relationships and spatial relationships between the joints and image using the appearance and position features, respectively; we utilize these complementary relationships to promote the joints' location in the image. The two processes mentioned above are conducted iteratively. Finally, only the refined position features are used for hand pose estimation. This strategy avoids the step of mapping heterogeneous appearance features to hand-joint positions. Our method significantly outperforms state-of-the-art methods on the large-scale InterHand2.6M dataset. More impressively, our method exhibits strong generalization ability on in-the-wild images. The code will be released.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: The interacting hand pose estimation task significantly contributes to multimedia and multimodal processing. It plays a vital role in human-computer interaction systems, enhancing user experience and facilitating tasks like gesture recognition, sign language interpretation, and virtual reality interactions.
Integrating hand pose estimation with other modalities such as speech, facial expressions, and body movements improves the robustness and accuracy of multimodal systems. For example, combining hand pose estimation with facial and speech analysis in video conferencing enhances communication and collaboration.
Understanding hand poses in multimedia analysis provides insights into human actions and intentions, enabling applications like action recognition, object manipulation detection, and hand gesture-based control systems. Incorporating hand pose estimation into multimodal processing frameworks enables a richer understanding of human behavior.
In conclusion, the interacting hand pose estimation task contributes to multimedia and multimodal processing by enhancing user interaction, improving system performance, and enabling a deeper understanding of human actions and intentions. Its integration with other modalities opens up new possibilities for advanced multimedia applications and human-centric technologies.
Supplementary Material: zip
Submission Number: 2377
Loading