Eliminating Semantic Ambiguity in Human Pose Estimation via Stable Feature Upsampling

Shu Jiang, Dong Zhang, Rui Yan, Xiangbo Shu, Pingcheng Dong, Long Chen, Xiaoyu Du

Published: 01 Jan 2025, Last Modified: 15 Jan 2026IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0
Abstract: Human pose estimation is a challenging research task in the computer vision community due to the semantic ambiguity problem caused by inevitable occlusions, varying body shapes, and complex articulations. Although deep learning-based methods have significantly improved the performance of this task, existing feature upsampling operations, e.g., bilinear interpolation and transposed convolution, within current convolutional neural networks and Transformer frameworks suffer from a multitude of limitations, including the inability to adapt to specific tasks and the loss of fine-grained semantic details. In this work, we propose a simple yet effective two-step stable feature upsampling (SIU) strategy that addresses these limitations by leveraging a learnable and efficient upsampling operation. Specifically, we first apply periodic shuffling to increase the resolution of the feature maps. Secondly, we utilize convolution layers to adjust the size of feature channels to match those of the input feature maps. The proposed SIU enables the entire network to adapt to the specific feature requirements of the human pose estimation task, making it more effective in preserving spatial information. Quantitatively, extensive experimental results on the challenging COCO-WholeBody dataset validate that our approach outperforms state-of-the-art methods accurately and efficiently, and possesses strong transferability, making it applicable to a wide range of baselines. Moreover, the qualitative results validate that SIU can effectively eliminate the semantic ambiguity problem in challenging pose scenarios, such as occlusions and overlapping. The code and weights have been released at: SIU.
Loading