Abstract: We present EgoHumans, a new multi-view multi-human
video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indooronly scenarios, which limit the generalization of computer
vision algorithms for real-world applications. We propose
a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations
to support diverse tasks such as human detection, tracking,
2D/3D pose estimation, and mesh recovery. We leverage
consumer-grade wearable camera-equipped glasses for the
egocentric view, which enables us to capture dynamic activities like playing soccer, fencing, volleyball, etc. Furthermore,
our multi-view setup generates accurate 3D ground truth
even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse
scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric
views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario,
specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a
multi-stream transformer architecture and explicit 3D spatial
reasoning to estimate and track the human pose. EgoFormer
significantly outperforms prior art by 13.6% IDF1 and 9.3
HOTA on the EgoHumans dataset.
0 Replies
Loading