Swin-CAE: Capsule Autoencoder using Shifted Windows for 3D Human Pose Estimation

17 May 2023OpenReview Archive Direct UploadReaders: Everyone
Abstract: Recently, 3D Human Pose Estimation from monocular videos performs poorly when handling viewpoints unseen. Traditional deep learning methods always use a view-invariant operation like convolution and max pooling. However, these methods do not necessarily improve viewpoint generalization, rather relying on more data. To tackle this limitation, we propose using Capsule Autoencoder with using Shifted Windows model, dubbed Swin-CAE. It can preserve the spatial hierarchy of each joint and geometrical structure in the feature space. Our model achieves characteristics of viewpoint equivariance by modeling the following characteristics: (1) Using capsule autoencoder can parse the spatial hierarchies relationships of part-object; (2) Constructing self-capsules communication and assemble features of cross-capsule to model the 3D pose; (3) Proposing Serial-Parallel Double Attention that uses shifted windows to capture part-object geometry and fine structures in the scene. Extensive experiments show that our Swin-CAE achieves comparable results with state-of-the-art models on the challenging viewpoint transfer task with two commonly used datasets: Human3.6M and MPI-INF-3DHP. Particularly, our method outperforms the previous model by a large margin of 18.4% on MPI-INF-3DHP.
0 Replies

Loading