Abstract: Accurate human pose estimation from low-resolution images remains a challenging problem due to quantization errors and occlusion issues. In this paper, we introduce HigherPose, a bottom-up approach leveraging a structure-aware transformer to enhance joint localization accuracy at lower resolutions. By integrating the transformer with a multi-scale aggregation module, our approach effectively capturing long-range dependencies and high-resolution representation information. Furthermore, the associative embedding (AE) module is employed for joint grouping, and the deconvolution (Deconv) module is used to extract high-resolution feature maps. Extensive experiments on the COCO2017 and CrowdPose datasets demonstrate that our method significantly outperforms state-of-the-art bottom-up methods, especially when dealing with low-resolution inputs. Evaluated on the COCO2017 and CrowdPose test-dev dataset, our method achieves 43.4 AP with \(256\times 256\) input resolution, surpassing HigherHRNet by 2.7 AP. On the CrowdPose test dataset, we achieve 49.1 AP, an improvement of 10.6 AP over HigherHRNet. These results highlight the effectiveness of our structure-aware transformer in mitigating quantization errors and improving joint localization accuracy in low-resolution human pose estimation. Code and models are available at https://github.com/jcong0226/HigherPose.
External IDs:dblp:journals/vc/LiangLLLG26
Loading