Abstract: Human pose estimation has been greatly advanced in recent years. However, even the best-performing models are not shift equivariant. In particular, a small change in input images often results in drastic alterations in output, which are problematic especially in video applications. The prevalence of top-down approaches, which typically rely on a (non-equivariant) object detector in the first stage, exac-erbates this issue. In this paper, we first demonstrate that the biased keypoint representation and the non-equivariant network components are the two main obstacles to shift equivariant pose estimation. To address the limitation, we propose an unbiased decoding method, and redesign the necessary network components (e.g., APS-ResBlock, SSP). Extensive experiments show that our method not only produces much more stable results with shifting input, but also achieves better metrics with the ability of tolerating in-accurate detector output from the first stage. To our knowledge, this is the first work to address the problem of shift equivariance in the field of pose estimation. Our method could be easily applied to existing CNN-based pose estimation networks.
External IDs:dblp:conf/wacv/WangLW025
Loading