Efficient Dual-Scale Cross-Attention Network for Human Pose Estimation

Dong Wang, Yiming Tang, Youcheng Cai, Xiaoping Liu

Published: 01 Jan 2025, Last Modified: 12 Nov 2025IEEE Transactions on Instrumentation and MeasurementEveryoneRevisionsCC BY-SA 4.0
Abstract: Human pose estimation has witnessed substantial advancements with the utilization of the Vision Transformer (ViT), which can accurately measure keypoint positions by leveraging ViT’s ability to extract global features. While increasing the depth or width of the self-attention mechanism can improve the accuracy of joint measurement, it inevitably results in higher computational complexity. This article presents an efficient dual-scale cross-attention network (EDCNet) for the precise human keypoint measurement. EDCNet incorporates a space split module (SSM) and a dual-scale cross-attention module (DCM), striking a more favorable balance between efficiency and accuracy compared to the conventional stacking of vanilla ViT. At the core of DCM lies a novel structure termed the channel-spatial ViT (CSViT). By leveraging the cross-attention mechanism formed by dual-scale CSViT, DCM effectively captures both local and global spatial dependencies. Specifically, CSViT utilizes a spatial activation unit (SAU) to amalgamate independent spatial information from the query, key, and value, effectively integrating both short- and long-range dependencies. Concurrently, we introduce a channel activation unit (CAU) within CSViT to enhance channel awareness through successive convolutions. Moreover, the SSM facilitates access to multispatial features through cost-effective spatial splitting, in conjunction with DCM, enhancing the overall performance of EDCNet. Extensive experiments on the COCO and MPII benchmarks demonstrate the effectiveness of our proposed EDCNet, obtaining 74.6 and 91.9 AP on COCO val and MPII val, respectively, which achieves superior performance with fewer parameters and lower computational costs than state-of-the-art methods.
Loading