Enhancing Human Pose Estimation in the Internet of Things via Diffusion Generative Models

Sifan Wu

Published: 30 Jan 2025, Last Modified: 05 Mar 2025IEEE Internet of Things JournalEveryoneCC BY 4.0

Abstract: With the ongoing development of public video surveillance technology, accurate human pose estimation is becoming increasingly important in urban administration and law enforcement. However, existing methods rely on large-scale dense annotations, which are labor-intensive and time-consuming. To tackle this, we propose SparsePose which leverages training videos with sparse annotations (labeled every k frames) to learn to propagate temporal poses that help to estimate the poses in unlabeled frames. Technically, we engage in a novel dual-branch architecture that combines 1) pose forecasting of the consecutive neighboring frames with 2) visual clues of the current frame and the nearest labeled frames. We theoretically derive the intra-branch and inter-branch mutual information loss to supervise that maximized pose-relevant features are extracted from the current frame and different branches complement each other to approach precise pose estimation. Additionally, we propose a diffusion generative enhancement, which improves the robustness of the model to challenging scenes from the perspective of diversity. Empirical results show that our method significantly outperforms the state-of-the-art methods in sparsely labeled pose estimation on three benchmark datasets.