Abstract: With the ongoing development of public video surveillance technology, accurate human pose estimation is becoming increasingly important in urban administration and law enforcement. However, existing methods rely on large-scale dense annotations, which are labor-intensive and time-consuming. To tackle this, we propose SparsePose which leverages training videos with sparse annotations (labeled every k frames) to learn to propagate temporal poses that help to estimate the poses in unlabeled frames. Technically, we engage in a novel dual-branch architecture that combines 1) pose forecasting of the consecutive neighboring frames with 2) visual clues of the current frame and the nearest labeled frames. We theoretically derive the intrabranch and interbranch mutual information loss to supervise that maximized pose-relevant features are extracted from the current frame and different branches complement each other to approach precise pose estimation. Additionally, we propose a diffusion generative enhancement, which improves the robustness of the model to challenging scenes from the perspective of diversity. Empirical results show that our method significantly outperforms the state-of-the-art methods in sparsely labeled pose estimation on three benchmark datasets.
Loading