ESMformer: Error-aware self-supervised transformer for multi-view 3D human pose estimation

Lijun Zhang, Kangkang Zhou, Feng Lu, Zhenghao Li, Xiaohu Shao, Xiang-Dong Zhou, Yu Shi

Published: 01 Jan 2025, Last Modified: 13 Nov 2024Pattern Recognit. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•We develop a novel transformer-based multi-view 3D HPE framework (ESMformer). It hierarchically integrates single-view multi-level pose feature extraction with progressive feature fusion across viewpoints and levels. By mining out meaningful multi-level information from multiple viewpoints, ESMformer effectively enriches pose feature expression.•We design a simple yet effective relative attention mechanism to model the spatial dependencies among all human joints, significantly enhancing the pose feature representations and making the feature more robust and adaptable to changes in viewpoints and environmental conditions.•We explore an error-aware self-supervised learning strategy that reduces the model’s reliance on 3D pose annotations and mitigates the impacts of incorrect 2D poses. This strategy utilizes the prediction errors of 3D poses to guide the selection of reliable 2D poses.•ESMformer achieves state-of-the-art results on three typical 3D HPE benchmarks. It significantly alleviates depth ambiguity and improves 3D HPE performance while maintaining a cost-effective computational complexity, without requiring any 3D pose annotations.