Abstract: Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision. Nevertheless, this task remains difficult because of the complex video scenes, such as video defocus and self-occlusion. Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation. However, they often ignore the useful joint information encoded in the initial heatmap, which is a by-product of the backbone generation. Comparatively, methods that attempt to refine the initial heatmap fail to consider any spatio-temporal motion features. As a result, the performance of existing methods for pose estimation falls short due to the lack of ability to leverage both local joint (heatmap) information and global motion (feature) dynamics.
To address this problem, we propose a novel joint-motion mutual learning framework for pose estimation, which effectively concentrates on both local joint dependency and global pixel-level motion dynamics. Specifically, we introduce a context-aware joint learner that adaptively leverages initial heatmaps and motion flows to retrieve robust local joint features. Given that local joint features and global motion flows are complementary, we further propose a progressive joint-motion mutual learning that synergistically exchanges information and interactively learns between joint features and motion flows to improve the capability of the model. More importantly, to capture more diverse joint and motion cues, we theoretically analyze and propose an information orthogonality objective to avoid learning redundant information from multi-cues. Empirical experiments show our method outperforms prior arts on three challenging benchmarks.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This work aims to utilize multi-modal data (heatmap and dense optical flow) to complete the challenging task of human pose estimation in videos, contributing to multimedia processing. By introducing a joint motion mutual learning framework, it effectively integrates local joint features and global motion flow, improving the capabilities of the pose estimation model. The proposed approach is consistent with the goals of ACM MM by facilitating the understanding and analysis of multimedia content, especially in the field of human activity recognition and understanding.
Supplementary Material: zip
Submission Number: 2904
Loading