AugLift: Improving Generalization of 3D Pose Lifting via Depth Cues

Apaar Sadhwani

Published: 15 Mar 2025, Last Modified: 07 Apr 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: A typical 3D human pose estimation approach is to ``lift" a 2D pose by leveraging motion. A sequence of 2D poses helps resolve the depth ambiguities inherent in a single 2D pose by harnessing temporal information. The prevailing belief is that longer sequences capture more context and improve lifting performance, but our findings challenge this assumption. We find that generalization for models with longer sequences is inherently more challenging, as they overfit to dataset-specific motion patterns. Motivated by these findings, we explore reducing motion dependence by extracting richer information from a single frame. We introduce AugLift, which enhances lifting by incorporating per-keypoint depth and occlusion cues alongside the 2D keypoints. These cues are obtained from a monocular depth estimation model, allowing AugLift to leverage generalizable depth priors without requiring a depth sensor. In experiments across 4 datasets, single-frame AugLift improves cross-dataset performance by 21\%, within-dataset performance by 8\% and reduces the generalization gap by 27\%. Notably, AugLift exhibits improved and more stable performance across sequence lengths, achieving optimal performance with shorter sequence lengths and reducing reliance on extended motion context. These results indicate that AugLift builds a more robust representation, improving generalization across datasets. Code to be released later.