SwinCAE: Capsule Autoencoder using Shifted Windows for 3D Human Pose Estimation

Xiufeng Liu, Zhongqiu Zhao, Yi Yang, Donghui Hu, Zhao Zhang

Published: 2025, Last Modified: 16 Jan 2026ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Estimating 3D human poses from monocular videos is a challenging task, primarily due to self-occlusion. Many existing methods struggle with unseen viewpoints as they rely on large amounts of data rather than enhancing their generalization ability across different viewpoints. To overcome this limitation, we propose a novel approach using a capsule autoencoder integrated with the shifted-windows model (SwinCAE), which can enhance prediction accuracy by effectively capturing the spatial hierarchical relationship between the parts and objects. Furthermore, we build a Parallel Double Attention with Shifted Windows module to enhance computational efficiency and modeling capacity. Additionally, we construct a Multi-Attention Collaborative module to capture diverse information, including both coarse and fine details. Through the collaboration of these modules, the model representation is significantly improved, resulting in a more accurate generated pose. Extensive experiments demonstrate that SwinCAE achieves better or comparable results to state-of-the-art models about 3D human pose estimation task.

External IDs:dblp:conf/icmcs/LiuZYHZ25a