Learning Local-Global Contextual Adaptation for Fully End-to-End Bottom-Up Human Pose EstimationDownload PDF

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone
Keywords: Human Pose Estimation, Structured Prediction, Deep Learning, Computer Vision
Abstract: This paper presents a method of learning LOcal-GlObal Contextual Adaptation for fully end-to-end and fast bottom-up human Pose estimation, dubbed as LOGOCAP. It is built on the conceptually simple center-offset formulation that lacks inaccuracy for pose estimation. When revisiting the bottom-up human pose estimation with the thought of “thinking, fast and slow” by D. Kahneman, we introduce a “slow keypointer” to remedy the lack of sufficient accuracy of the “fast keypointer”. In learning the “slow keypointer”, the proposed LOGO-CAP lifts the initial “fast” keypoints by offset predictions to keypoint expansion maps (KEMs) to counter their uncertainty in two modules. Firstly, the local KEMs (e.g. 11×11) are extracted from a low-dimensional feature map. A proposed convolutional message passing module learns to “re-focus” the local KEMs to the keypoint attraction maps (KAMs) by accounting for the structured output prediction nature of human pose estimation, which is directly supervised by the object keypoint similarity (OKS) loss in training. Secondly, the global KEMs are extracted, with a sufficiently large region-of-interest (e.g., 97 × 97), from the keypoint heatmaps that are computed by a direct map-to-map regression. Then, a local-global contextual adaptation module is proposed to convolve the global KEMs using the learned KAMs as the kernels. This convolution can be understood as the learnable offsets guided deformable and dynamic convolution in a pose-sensitive way. The proposed method is end-to-end trainable with near real-time inference speed, obtaining state-of-the-art performance on the COCO keypoint benchmark for bottom-up human pose estimation. With the COCO trained model, our LOGO-CAP also outperforms prior arts by a large margin on the challenging OCHuman dataset.
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
TL;DR: This paper presents a method of learning LOcal-GlObal Contextual Adaptation for fully end-to-end and fast bottom-up human Pose estimation, obtaining state-of-the-art performance with nearly real-time inference speed.
Supplementary Material: pdf
18 Replies

Loading