PCDPose: enhancing the lightweight 2D human pose estimation model with pose-enhancing attention and context broadcasting

Published: 01 Jan 2025, Last Modified: 30 Apr 2025Pattern Anal. Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: 2D human pose estimation is an important domain in computer vision. In recent years, the lightweight 2D human pose estimation (2DTLHPE) models based on vision transformer (ViT) have attracted extensive attention due to the fewer parameters and the lower computational requirements. However, these models are also facing the challenges of cluttered and occluded background. It results in the errors of locating keypoints. Therefore, this paper proposes the pose-enhanced contextual distillation for pose estimation model (PCDPose) to alleviate the influence of the challenges. Firstly, PCDPose introduces the pose-enhancing attention (PEA) module which highlights the foreground information in the feature map. It alleviates the influence caused by the cluttered background. Moreover, PCDPose introduces the context broadcasting (CB) module, which builds the long-range dependencies between the keypoints in occluded regions and the neighboring keypoints by broadcasting the context to each vision token (VT). It alleviates the influence caused by the occluded background. Experimental results show that PCDPose achieves a 73.5% average precision (AP) on the COCO2017 dataset, and it has a 1% performance improvement over the state-of-the-art (SOTA) model. On the CrowdPose dataset, PCDPose achieves a 71.3% AP, and it has a 5.9% performance improvement over the SOTA model.
Loading