Geometry-Guided Diffusion Model with Masked Transformer for Robust Multi-View 3D Human Pose Estimation
Abstract: Recent research on Diffusion Models and Transformers has brought significant advancements to 3D Human Pose Estimation (HPE). Nonetheless, existing methods often fail to concurrently address the issues of accuracy and generalization. In this paper, we propose a **G**eometry-guided D**if**fusion Model with Masked Trans**former** (Masked Gifformer) for robust multi-view 3D HPE. Within the framework of the diffusion model, a hierarchical multi-view transformer-based denoiser is exploited to fit the 3D pose distribution by systematically integrating joint and view information. To address the long-standing problem of poor generalization, we introduce a fully random mask mechanism without any additional learnable modules or parameters. Furthermore, we incorporate geometric guidance into the diffusion model to enhance the accuracy of the model. This is achieved by optimizing the sampling process to minimize reprojection errors through modeling a conditional guidance distribution. Extensive experiments on two benchmarks demonstrate that Masked Gifformer effectively achieves a trade-off between accuracy and generalization. Specifically, our method outperforms other probabilistic methods by $\textgreater 40\\%$ and achieves comparable results with state-of-the-art deterministic methods. In addition, our method exhibits robustness to varying camera numbers, spatial arrangements, and datasets.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Experience] Multimedia Applications, [Content] Vision and Language
Relevance To Conference: This paper focuses on multi-view 3D Human Pose Estimation, which aims to accurately localize sparse joints on the human skeleton in 3D space through analyzing the multimedia data such as images and videos. One of the key contributions of our paper is to effectively and robustly fuse multimedia data from multiple cameras, which can serve as a reference for other multimedia/multimodal processing challenges. The obtained geometric and positional information of the human body can also be beneficial for various multimedia tasks, including action recognition and prediction, immersive telepresence, distance education, and metaverse. Relevant research has been widely published in recent ACM MM conferences, with over 10 papers [1,2,3] presented in 2023.
[1] Wang, Tao, et al. "DecenterNet: Bottom-Up Human Pose Estimation Via Decentralized Pose Representation."
[2] Liu, Hanbing, et al. "Posynda: Multi-hypothesis pose synthesis domain adaptation for robust 3d human pose estimation."
[3] Zhou, Kangkang, et al. "Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation."
Submission Number: 3159
Loading