Keywords: Human Pose Estimation, Language Models
TL;DR: Our work proposes a model to generate fine grained pose descriptions through advanced language super- vision outperforming existing 3D Pose estimator & VLM approaches that produce generic and inaccurate responses.
Abstract: Despite the progress in 3D human pose estimation, its reliance on expensive multi-view setups and limited dataset
availability hinders scalability in real-world applications. We propose a novel framework that distills 3D spatial understanding into a language-aligned representation space using only 2D and 3D pose skeletons rendered as images. Our method learns a shared embedding space where 2D and 3D pose images are projected close to their corresponding natural language descriptions. During training, the model leverages 3D pose supervision to enrich semantic alignment, while at test time, it operates exclusively on 2D poses inferred from real images. This enables high-quality language-based reasoning such as action descriptions and question answering
without an additional computational cost or supervision requirements of 3D pose estimators at inference. Our approach not only reduces reliance on 3D sensors but also demonstrates that 2D pose alone, when trained with 3D-informed language grounding, can achieve rich semantic understanding. Experiments on a newly curated dataset of 80K annotated pose images confirm the effectiveness of our method, showing 20.8% and 44.1% improvements over 2D-only baselines and 1.8% and 1.3% improvements over 3D methods in VQA accuracy
and BLEU-4 scores respectively.
Submission Number: 1
Loading