Continuity-Driven Pose Estimation for Videos

Zhenkun Fan; Zhuoxu Huang; Tao Xu; Jungong Han

Continuity-Driven Pose Estimation for Videos

Zhenkun Fan, Zhuoxu Huang, Tao Xu, Jungong Han

27 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video pose estiamtion

TL;DR: In this paper, we introduce a novel approach to supervise continuity in the video pose estimation model from two perspectives: semantic continuity and keypoint distribution continuity.

Abstract: Video-based pose estimation plays a critical role in understanding human actions and enabling effective human-computer interaction. By exploiting temporal information from video frames, it enhances the localization of human keypoints. Previous feature-fusion methods often rely on a frozen single-frame backbone trained on individual frames, followed by a network to learn temporal information from video sequences. Consequently, these approaches fail to capture the temporal continuity between frames at the backbone network level, thereby restricting the network's capacity to effectively learn and leverage sequential information. In this paper, we introduce a novel approach to supervise continuity in the whole video pose estimation model from two perspectives: semantic continuity and pixel-wise keypoint distribution continuity. To this end, we propose a Semantic Alignment Space, where a semantic alignment encodes feature maps from different frames into this space, ensuring continuous supervision of the encoded representations. To further maintain pixel-wise keypoint distribution continuity, we introduce the Trajectory Probability Difference Integration method, which minimizes the trajectory difference expectation across frames. Additionally, to better capture temporal dependencies, we present a Multi-frame Heatmap Fusion structure that aggregates heatmaps from adjacent frames for a more refined output. Extensive experiments on the PoseTrack17, PoseTrack18, and PoseTrack21 datasets demonstrate the effectiveness of our approach, consistently achieving state-of-the-art results.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11765

Loading