Aggregation Transformer for Human Pose Estimation

Hao Dong, Guodong Wang, Xinyue Zhang

Published: 01 Jan 2022, Last Modified: 04 Nov 2025ICPR 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Transformers, which are famous for their attention mechanisms, have a strong ability to extract global information. The advantage of CNNs is parameter sharing, which can extract local information well. Both global information and local information are important for pose recognition. However, the existing human pose estimation methods based on Transformers or their variants can not extract the local information of images very well in mid-sized datasets (See [1] for more details). Therefore, we propose a novel Transformer framework to solve the problem of human pose estimation, termed ATPose (Aggregation Transformer for Human Pose Estimation): (1) We embed the convolution operations into the decoder of the Transformer to extract the local information. The attention module of Transformer first extracts global information from feature maps, and then the convolution layers perform convolution operations on feature maps to focus on local information. Thus, the Transformer can extract global information and local information at the same time. (2) We introduce the sparse attention mechanism and multi-scale attention mechanism into the Transformer. The sparse attention mechanism allows the feature pixels to only interact with sampled pixels, rather than all pixels, which reduces the computation cost. Multi-scale attention mechanism can calculate attention on feature maps with different resolutions to better extract small target features. (3) The Keypoint Head module is added to the decoder. The Keypoint Head can refine the prediction results of the model, making the predicted coordinates more accurate, and can also guide the training to prevent the model from deviating. The experimental results show ATPose has achieved state-of-the-art performance and become a new baseline in regression-based methods.