- TL;DR: We use mixture density networks to do full conditional density estimation for spatial offset regression and apply it to the human pose estimation task.
- Abstract: Offset regression is a standard method for spatial localization in many vision tasks, including human pose estimation, object detection, and instance segmentation. However, if high localization accuracy is crucial for a task, convolutional neural networks will offset regression usually struggle to deliver. This can be attributed to the locality of the convolution operation, exacerbated by variance in scale, clutter, and viewpoint. An even more fundamental issue is the multi-modality of real-world images. As a consequence, they cannot be approximated adequately using a single mode model. Instead, we propose to use mixture density networks (MDN) for offset regression, allowing the model to manage various modes efficiently and learning to predict full conditional density of the outputs given the input. On 2D human pose estimation in the wild, which requires accurate localisation of body keypoints, we show that this yields significant improvement in localization accuracy. In particular, our experiments reveal viewpoint variation as the dominant multi-modal factor. Further, by carefully initializing MDN parameters, we do not face any instabilities in training, which is known to be a big obstacle for widespread deployment of MDN. The method can be readily applied to any task with a spatial regression component. Our findings highlight the multi-modal nature of real-world vision, and the significance of explicitly accounting for viewpoint variation, at least when spatial localization is concerned.
- Keywords: Mixture Density Estimation, Spatial Offset Regression, Dense Prediction, Human Pose Estimation