SwinCAE: Capsule Autoencoder using Shifted Windows for 3D Human Pose Estimation

Published: 2025, Last Modified: 16 Jan 2026ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Estimating 3D human poses from monocular videos is a challenging task, primarily due to self-occlusion. Many existing methods struggle with unseen viewpoints as they rely on large amounts of data rather than enhancing their generalization ability across different viewpoints. To overcome this limitation, we propose a novel approach using a capsule autoencoder integrated with the shifted-windows model (SwinCAE), which can enhance prediction accuracy by effectively capturing the spatial hierarchical relationship between the parts and objects. Furthermore, we build a Parallel Double Attention with Shifted Windows module to enhance computational efficiency and modeling capacity. Additionally, we construct a Multi-Attention Collaborative module to capture diverse information, including both coarse and fine details. Through the collaboration of these modules, the model representation is significantly improved, resulting in a more accurate generated pose. Extensive experiments demonstrate that SwinCAE achieves better or comparable results to state-of-the-art models about 3D human pose estimation task.
Loading