Attend to Who You Are: Supervising Self-Attention for Keypoint Detection and Instance-Aware Association
Keywords: human pose estimation, bottom-up, self-attention, transformer, instance segmentation
Abstract: Bottom-up multi-person pose estimation models need to detect keypoints and learn associative information between keypoints.
We argue that these problems can be entirely solved by the Transformer model. Specifically, the self-attention in Transformer measures the pairwise dependencies between locations, which can play a role in providing association information for keypoints grouping.
However, the naive attention patterns are still not subjectively controlled, so there is no guarantee that the keypoints will always attend to the instances to which they belong.
To address it we propose a novel approach of multi-person keypoint detection and instance association using instance masks to supervise self-attention. By supervising self-attention to be instance-aware, we can assign the detected keypoints to the correct human instances based on the pairwise attention scores, without using pre-defined offset vector fields or embedding like CNN-based bottom-up models. An additional benefit of our method is that the instance segmentation results of any number of people can be directly obtained from the supervised attention matrix, thereby simplifying the pixel assignment pipeline.
The experiments on the COCO multi-person keypoint detection challenge and person instance segmentation task demonstrate the effectiveness and simplicity of the proposed method.
One-sentence Summary: We propose a novel approach of multi-person keypoint detection and instance association using instance masks to supervise self-attention
16 Replies
Loading