3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space

Xinru Guo; Song Xu; Xiangbo Lin; Yi Sun; Xiaohong Ma

3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space

Xinru Guo, Song Xu, Xiangbo Lin, Yi Sun, Xiaohong Ma

Published: 01 Jan 2022, Last Modified: 20 May 2024Pattern Anal. Appl. 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Based on the disentanglement representation learning theory and the cross-modal variational autoencoder (VAE) model, we derive a “Single Input Multiple Output” (SIMO) disentangled model \({\text{cmSIMO} - \beta \,\text{VAE}}\). With the guidance of this derived model, we design a new VAE network, named da-VAE, for the challenging task of 3D hand pose estimation from a single RGB image. The designed da-VAE network has a multi-head encoder with the attention modules. Cooperating with the specific supervisions, the latent space is decomposed into subspaces with explicit semantics, which are relevant to the generative factors of hand pose, shape, appearance and others. The performance of the proposed da-VAE network is evaluated on RHD and STB dataset. The experimental results show competitive accuracies with the state-of-the-art methods.

Loading