Abstract: Convolutional neural networks (CNNs) have been widely utilized in many computer vision tasks. However, CNNs have a fixed reception field and lack the ability of long-range perception, which is crucial to human pose estimation. Transformer architecture has been adopted to computer vision applications recently and is proven to be a highly effective architecture. We are interested in exploring its capability in human pose estimation, and thus propose a novel model based on transformer, enhanced with a feature pyramid fusion structure. More specifically, we use pre-trained Swin Transformer to extract features, and leverage a feature pyramid structure to extract and fuse feature maps from different stages. The experiment results of our study have demonstrated that the proposed transformer-based model can achieve better performance compared to the state-of-the-art CNN-based models.
External IDs:dblp:conf/mipr/XiongWLLC22
Loading