Dual networks based 3D Multi-Person Pose Estimation from Monocular VideoDownload PDFOpen Website

20 Sept 2022 (modified: 20 Sept 2022)OpenReview Archive Direct UploadReaders: Everyone
Abstract: We propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. We also introduce a two-person pose discriminator that enforces natural two-person interactions. Finally, we apply a semi-supervised method to overcome the 3D ground-truth data scarcity. Our evaluations demonstrate the effectiveness of the proposed method and its individual components.
0 Replies

Loading