Multi-view 3D Smooth Human Pose Estimation based on Heatmap Filtering and Spatio-temporal Information
Abstract: The estimation of 3D human poses from time-synchronized, calibrated multi-view video usually consists of two steps: (1) a 2D detector to locate the 2D coordinate point position of the joint via heatmaps for each frame and (2) a post-processing method such as the recursive pictorial structure model or robust triangulation to obtain 3D coordinate points. However, most existing methods are based on a single frame only. They do not take advantage of the temporal characteristics of the video sequence itself, and must rely on post-processing algorithms. They are also susceptible to human self-occlusion, and the generated sequences suffer from jitter. Therefore, we propose a network model incorporating spatial and temporal features. Using a coarse-to-fine approach, the proposed heatmap temporal network (HTN) generates temporal heatmap information, with an occlusion heatmap filter used to filter low-quality heatmaps before they are sent to the HTN. The heatmap fusion and the triangulation weights are dynamically adjusted, and intermediate supervision is employed to enable better integration of temporal and spatial information. Our network is also end-to-end differentiable. This overcomes the long-standing problem of skeleton jitter being generated and ensures that the sequence is smooth and stable.
0 Replies
Loading