- Original Pdf: pdf
- Abstract: Current bottom-up approaches for 2D multi-person pose estimation (MPPE) detect joints collectively without distinguishing between individuals. Associating the joints into individual poses is done independently of the learning algorithm, therefore requires formulating a separate problem in a post-processing step, which relies on relaxations or sophisticated heuristics. We propose a differentiable learning-based model that performs part detection and association jointly, thereby eliminating the need for further post-processing. The approach introduces a recurrent neural network (RNN), which takes dense low-level features as input and predicts the heatmaps of a single person's joints in each iteration, then refines them in a feedback loop. In addition, the network learns a stopping criterion in order to halt once it has identified all individuals in an image, allowing it to output any number of poses. Furthermore, we introduce an efficient implementation that allows training on memory-constrained machines. The approach is evaluated on the challenging COCO and OCHuman datasets and substantially outperforms the baseline. On OCHuman, which contains severe occlusions, we achieve state-of-the-art results even compared to top-down approaches. Our results demonstrate the advantage of a learning-based detection and association framework, and the advantage of bottom-up approaches over top-down approaches in complex scenarios.