ReCAP: Recursive Prompting for Self-Supervised Category-Level Articulated Pose Estimation from an Image
Keywords: articulated object pose estimation
Abstract: Estimating category-level articulated object poses is crucial for robotics and virtual reality.
Prior works either rely on costly annotations, limiting scalability, or depend on auxiliary signals such as dense RGB-D sensing and geometric constraints that are rarely available in practice.
As a result, articulated pose estimation from a single RGB image remains largely unsolved.
We propose ReCAP, a Recursive prompting for self-supervised Category-level Articulated object Pose estimation from an image.
ReCAP adapts a pre-trained foundation model using a Recursive Prompt Generator with residual injection, introducing less than 1\% additional parameters.
This mechanism enables parameter-efficient scaling through recursive refinement, while residual injection preserves token alignment under dynamic reconfiguration, yielding robust articulated-object adaptation.
To further resolve structural ambiguities, we introduce $\mathcal{X}$-SGP, a multi-scale fusion module that adaptively integrates semantic and geometric cues, an aspect often overlooked by geometry-centric approaches.
Experiments on synthetic and real benchmarks demonstrate state-of-the-art monocular articulated pose estimation without requiring 3D supervision or auxiliary depth input.
To the best of our knowledge, ReCAP is the first self-supervised framework to accomplish this task from a single image.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 6189
Loading