Abstract: Abstract. We present EgoPoseFormer, a simple yet effective transformerbased model for stereo egocentric human pose estimation. The main challenge in egocentric pose estimation is overcoming joint invisibility, which is
caused by self-occlusion or a limited field of view (FOV) of head-mounted
cameras. Our approach overcomes this challenge by incorporating a twostage pose estimation paradigm: in the first stage, our model leverages
the global information to estimate each joint’s coarse location, then in
the second stage, it employs a DETR style transformer to refine the
coarse locations by exploiting fine-grained stereo visual features. In addition, we present a Deformable Stereo Attention operation to enable our
transformer to effectively process multi-view features, which enables it to
accurately localize each joint in the 3D world. We evaluate our method on
the stereo UnrealEgo dataset and show it significantly outperforms previous approaches while being computationally efficient: it improves MPJPE
by 27.4mm (45% improvement) with only 7.9% model parameters and
13.1% FLOPs compared to the state-of-the-art. Surprisingly, with proper
training settings, we find that even our first-stage pose proposal network
can achieve superior performance compared to previous arts. We also show
that our method can be seamlessly extended to monocular settings, which
achieves state-of-the-art performance on the SceneEgo dataset, improving
MPJPE by 25.5mm (21% improvement) compared to the best existing
method with only 60.7% model parameters and 36.4% FLOPs. Code is
available at https://github.com/ChenhongyiYang/egoposeformer
.
Loading