ReCAP: Recursive Prompting for Self-Supervised Category-Level Articulated Pose Estimation from an Image

ICLR 2026 Conference Submission6189 Authors

15 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: articulated object pose estimation
Abstract: Estimating category-level articulated object poses is crucial for robotics and virtual reality. Prior works either rely on costly annotations, limiting scalability, or depend on auxiliary signals such as dense RGB-D sensing and geometric constraints that are rarely available in practice. As a result, articulated pose estimation from a single RGB image remains largely unsolved. We propose ReCAP, a Recursive prompting for self-supervised Category-level Articulated object Pose estimation from an image. ReCAP adapts a pre-trained foundation model using a Recursive Prompt Generator with residual injection, introducing less than 1\% additional parameters. This mechanism enables parameter-efficient scaling through recursive refinement, while residual injection preserves token alignment under dynamic reconfiguration, yielding robust articulated-object adaptation. To further resolve structural ambiguities, we introduce $\mathcal{X}$-SGP, a multi-scale fusion module that adaptively integrates semantic and geometric cues, an aspect often overlooked by geometry-centric approaches. Experiments on synthetic and real benchmarks demonstrate state-of-the-art monocular articulated pose estimation without requiring 3D supervision or auxiliary depth input. To the best of our knowledge, ReCAP is the first self-supervised framework to accomplish this task from a single image.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 6189
Loading