Compositional Token Modeling for Occlusion-Robust Human Pose Estimation​

ICLR 2026 Conference Submission18102 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Keypoint Detection, Human Pose Estimation​​, Multimodal Fusion, Keypoint Completion
Abstract: Current pose estimation systems show critical weaknesses in occlusion scenarios, where the simultaneous deterioration of visual appearance cues and biomechanical topology constraints leads to compounded errors. This chain of failures originates from models' inability to separate structural coherence preservation from feature-level corruption recovery. To address this question, we propose CM-PCT, a Cross-Modal Pose estimation model with Compositional Tokens. Our approach enhances occlusion robustness through four key technical innovations: (1) a keypoint coordinate completion mechanism for occluded joints, providing more complete data input to the model; (2) position vector embedding to enhance spatial representation, complementing the contextual information lacking in joint coordinate vectors; (3) SE attention for cross-modal feature fusion, reducing noise interference between features through channel-wise weight recalibration; and (4) a group-based loss function for differential optimization of body parts, improving estimation accuracy of occluded regions through targeted supervision. Compared to coordinate-driven pose estimators, CM-PCT fundamentally advances occlusion robustness through its probabilistic completion mechanism and anatomical embedding paradigm, demonstrating clinically significant reductions in joint ambiguity while maintaining biomechanical consistency under extreme occlusion. Extensive experiments on COCO and OCHuman datasets demonstrate our method achieves state-of-the-art performance, consistently demonstrating superior performance across diverse scenarios including standard benchmarks and occlusion-challenging environments.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18102
Loading