HDTPose: A Hierarchical Decoding Transformer for End-to-End Single-Stage Multi-Person Pose Estimation

Wei Zhang, Bin Xue, Qi Li, Zhenan Sun

Published: 2025, Last Modified: 26 May 2026IJCNN 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The estimation performance of human keypoints varies significantly across different keypoint types. Compared to prominent joints such as the head and shoulders, smaller and more flexible limb joints present greater challenges in identification and localization. Current single-stage methods typically treat all body joints uniformly, overlooking the inherent differences and structural relationships among keypoints. To address this limitation, we propose HDTPose, a fully end-to-end multi-person pose estimation framework based on a hierarchical decoding transformer. HDTPose formulates multi-person pose estimation as a hierarchical set prediction problem and employs a hierarchical decoder to progressively decode joints in an end-to-end manner. The decoding process consists of two stages: an instance-aware stage and a joint-aware stage, which explicitly model instance-wise and joint-wise relationships, respectively. Additionally, we introduce a structure-guided joint attention mechanism that leverages kinematic relationships to refine pose predictions. Extensive experiments on the COCO and MPII benchmarks demonstrate that HDTPose outperforms existing state-of-the-art single-stage methods, underscoring the effectiveness and superiority of our approach.

External IDs:dblp:conf/ijcnn/ZhangXLS25