Dtpose: Learning Disentangled Token Representation For Effective Human Pose Estimation

Published: 2024, Last Modified: 05 Nov 2025ICIP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Exploring rich visual clues and spatial geometric constraints to locate keypoints is essential for human pose estimation. Existing Transformer-based methods have presented unique advantages via token representation, where each keypoint is explicitly embedded as a token to learn visual appearance clues and geometric relationships simultaneously from images. However, it is difficult to learn powerful pose representation via self-attention mechanism due to latent interference, e.g., blurring and self-occlusion. To alleviate this challenge, we present a novel framework that Disentangles hybrid Token representation to explore more effective visual and keypoint information for Pose estimation (termed by DTPose). In detail, DTPose contains two key modules. First, the Disentangled Token Representation module is used to explore visual clues and geometry constraints sequentially, which alleviates the noise interference and enables the geometry and appearance clues to be exploited more sufficiently. Furthermore, the Hierarchical Spatial Decoding head is exploited to preserve the 2 D geometric structure information of keypoints as much as possible. Extensive experiments on COCO dataset demonstrate significant performance gains of our DTPose, which achieves 76.5 (+0.7) AP and 75.7 (+0.6) AP than the TokenPose-L on the COCO validation and test-dev sets separately.
Loading