STRIDE: Single-Video Based Temporally Continuous Occlusion-Robust 3D Pose Estimation

Published: 01 Jan 2025, Last Modified: 09 May 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Accurately estimating 3D human poses is crucial for fields like action recognition, gait recognition, and virtual/augmented reality. However, predicting human poses under severe occlusion remains a persistent and significant challenge. Existing image-based estimators struggle with heavy occlusions due to a lack of temporal context, resulting in inconsistent predictions, while video-based models, despite benefiting from temporal data, face limitations with prolonged occlusions over multiple frames. Additionally, existing algorithms often struggle to generalize unseen videos. Addressing these challenges, we propose STRIDE (Single-video based TempoRally contInuous Occlusion-Robust 3D Pose Estimation), a novel Test-Time Training (TTT) approach to fit a human motion prior for estimating 3D human poses for each video. Our proposed approach handles occlusions not encountered during the model's training by refining a sequence of noisy initial pose estimates into accurate, temporally coherent poses at test time, effectively overcoming the limitations of existing methods. Our flexible, model-agnostic framework allows us to use any off-the-shelf 3D pose estimation method to improve robustness and temporal consistency. We validate STRIDE's efficacy through comprehensive experiments on multiple challenging datasets where it not only outperforms existing single-image and video-based pose estimation models but also showcases superior handling of substantial occlusions, achieving fast, robust, accurate, and temporally consistent 3D pose estimates. Code is made publicly available at https://github.com/take2rohit/stride
Loading