Keywords: radar perception, 3D pose estimation, transformer, deformable attention
TL;DR: Estimate 3D human poses from multi-view radar data using 2D image-plane keypoints and 3D BBox labels, rather than more expensive 3D keypoint labels.
Abstract: Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose \textbf{RAPTR} (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect.
Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss.
Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by $34.3$\% on HIBER and $76.9$\% on MMVR. Our implementation is available at \url{https://github.com/merlresearch/radar-pose-transformer}.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 19792
Loading