Swin-CAE: Capsule Autoencoder using Shifted Windows for 3D Human Pose Estimation
Abstract: Recently, 3D Human Pose Estimation from monocular videos performs poorly when handling viewpoints unseen. Traditional deep
learning methods always use a view-invariant operation like convolution and max pooling. However, these methods do not necessarily
improve viewpoint generalization, rather relying on more data. To
tackle this limitation, we propose using Capsule Autoencoder with
using Shifted Windows model, dubbed Swin-CAE. It can preserve
the spatial hierarchy of each joint and geometrical structure in
the feature space. Our model achieves characteristics of viewpoint
equivariance by modeling the following characteristics: (1) Using
capsule autoencoder can parse the spatial hierarchies relationships
of part-object; (2) Constructing self-capsules communication and
assemble features of cross-capsule to model the 3D pose; (3) Proposing Serial-Parallel Double Attention that uses shifted windows to
capture part-object geometry and fine structures in the scene. Extensive experiments show that our Swin-CAE achieves comparable
results with state-of-the-art models on the challenging viewpoint
transfer task with two commonly used datasets: Human3.6M and
MPI-INF-3DHP. Particularly, our method outperforms the previous
model by a large margin of 18.4% on MPI-INF-3DHP.
0 Replies
Loading