Abstract: Human pose estimation is a challenging research task in the computer vision community due to the semantic ambiguity problem caused by inevitable occlusions, varying body shapes, and complex articulations. Although deep learning-based methods have significantly improved the performance of this task, existing feature upsampling operations, e.g., bilinear interpolation and transposed convolution, within current convolutional neural networks and Transformer frameworks suffer from a multitude of limitations, including the inability to adapt to specific tasks and the loss of fine-grained semantic details. In this work, we propose a simple yet effective two-step stable feature upsampling (SIU) strategy that addresses these limitations by leveraging a learnable and efficient upsampling operation. Specifically, we first apply periodic shuffling to increase the resolution of the feature maps. Secondly, we utilize convolution layers to adjust the size of feature channels to match those of the input feature maps. The proposed SIU enables the entire network to adapt to the specific feature requirements of the human pose estimation task, making it more effective in preserving spatial information. Quantitatively, extensive experimental results on the challenging COCO-WholeBody dataset validate that our approach outperforms state-of-the-art methods accurately and efficiently, and possesses strong transferability, making it applicable to a wide range of baselines. Moreover, the qualitative results validate that SIU can effectively eliminate the semantic ambiguity problem in challenging pose scenarios, such as occlusions and overlapping. The code and weights have been released at: SIU.
External IDs:doi:10.1109/tcsvt.2025.3585888
Loading