KTPose: Keypoint-Based Tokens in Vision Transformer for Human Pose Estimation

Published: 01 Jan 2023, Last Modified: 13 Nov 2024SMC 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Transformers have made remarkable progress on human pose estimation in recent years, however, vision tokens are all in a fixed position, a property unsuitable for unknown human deformation. In this paper, we propose KTPose, a novel keypoint-based tokens in Vision Transformer for human pose estimation, which includes an instance-aware keypoint head and a keypoint refinement with transformer. To address the limb deformation issue, the instance-aware keypoint head is devised to capture the discriminative features dynamically based on the coarse localized keypoints. Further, we propose the multi-granularity vision tokens, in which each keypoint is explicitly embedded as a token to simultaneously learn spatial dependencies and constraint relationships from vision transformer for human pose estimation. Extensive experiments are carried out on two benchmark datasets, which demonstrate that KTPose outperforms state-of-the-art methods and achieves 76.6AP (?1.06%) and 75.7AP (?0.93%) on COCO validation and test-dev sets, respectively. This is accomplished with a smaller computational footprint when compared to the current mainstream transformer-based methods. Code is publicly available 1 1 https://github.com/WINGS-999/KTPose.
Loading