Swin-Pose: Swin Transformer Based Human Pose Estimation

Zinan Xiong, Chenxi Wang, Ying Li, Yan Luo, Yu Cao

Published: 2022, Last Modified: 12 Nov 2025MIPR 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Convolutional neural networks (CNNs) have been widely utilized in many computer vision tasks. However, CNNs have a fixed reception field and lack the ability of long-range perception, which is crucial to human pose estimation. Transformer architecture has been adopted to computer vision applications recently and is proven to be a highly effective architecture. We are interested in exploring its capability in human pose estimation, and thus propose a novel model based on transformer, enhanced with a feature pyramid fusion structure. More specifically, we use pre-trained Swin Transformer to extract features, and leverage a feature pyramid structure to extract and fuse feature maps from different stages. The experiment results of our study have demonstrated that the proposed transformer-based model can achieve better performance compared to the state-of-the-art CNN-based models.

External IDs:dblp:conf/mipr/XiongWLLC22