Rethinking the Sparse End-to-End Multiperson Pose Estimation

Xixia Xu, Qi Zou, Jiamao Li

Published: 26 Dec 2024, Last Modified: 13 Feb 2025IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMSEveryoneRevisionsCC0 1.0

Abstract: Current methods of multiperson pose estimation (MPPE) typically treat the human detection and association of joints separately. They introduce complex hand-crafted poseprocesses like RoI cropping, NMS and grouping or rely on dense representations to preserve the spatial features. In this article, we dive a deeper thought into this task and propose a simpler and effective framework, termed SparsePose, which can directly predict multiperson joint coordinates from the full image without any post-processes and dense representations. In SparsePose, the full-body instances are decoupled by exploring spatialaware feature learning (SFL) without box and classification supervision. For improving the quality of instance map, the instance contrastive constraint (ICC) and center correction (CC) strategy are proposed to make the instance-wise spatial feature more discriminative. Importantly, we propose a visibility-guided weighting mechanism to enable model be confident to the visible joint predictions and insensitive to the occlusions or partial bodies. In general, SparsePose is conceptually simpler and plays favorably against the existing counterparts on three benchmarks in terms of both accuracy and efficiency.