Learning Context with Priors for 3D Interacting Hand-Object Pose Estimation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Achieving 3D hand-object pose estimation in interaction scenarios is challenging due to the severe occlusion generated during the interaction. Existing methods address this issue by utilizing the correlation between the hand and object poses as additional cues. They usually first extract the hand and object features from their respective regions and then refine them with each other. However, this paradigm disregards the role of a broad range of image context. To address this problem, we propose a novel and robust approach that learns a broad range of context by imposing priors. First, we build this approach using stacked transformer decoder layers. These layers are required for extracting image-wide context and regional hand or object features by constraining cross-attention operations. We share the context decoder layer parameters between the hand and object pose estimations to avoid interference in the context-learning process. This imposes a prior, indicating that the hand and object are mutually the most important context for each other, significantly enhancing the robustness of obtained context features. Second, since they play different roles, we provide customized feature maps for the context, hand, and object decoder layers. This strategy facilitates the disentanglement of these layers, reducing the feature learning complexity. Finally, we conduct extensive experiments on the popular HO3D and Dex-YCB databases. The experimental results indicate that our method significantly outperforms state-of-the-art approaches and can be applied to other hand pose estimation tasks. The code will be released.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: 3D hand-object pose estimation plays a pivotal role in multimedia and multimodal processing by enhancing the understanding of human interactions with digital content. By accurately estimating the poses of hands and objects in three dimensions, it enables more immersive virtual reality experiences, realistic computer graphics, and augmented reality applications. In multimedia content creation, such as animation and gaming, it facilitates the generation of lifelike movements and interactions between virtual characters and objects. Moreover, in multimodal interfaces, such as gesture-based controls and human-computer interaction systems, precise 3D pose estimation enables natural and intuitive interactions with devices and software. This technology can also aid in surveillance and security systems by analyzing human actions and interactions with objects in real-time. Overall, 3D hand-object pose estimation significantly contributes to advancing the capabilities of multimedia content creation, interactive systems, and multimodal interfaces, ultimately enhancing user experiences across various digital platforms.
Supplementary Material: zip
Submission Number: 2368
Loading