ROPES: Robotic Pose Estimation via Score-based Causal Representation Learning

Published: 19 Sept 2025, Last Modified: 19 Sept 2025NeurIPS 2025 Workshop EWMEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robotics, Causal Representational Learning
Abstract: We introduce RObotic Pose Estimation via Score-Based Causal Representation Learning (ROPES), a framework for recovering robot pose from raw images without sample-level labels. Existing vision-based estimators achieve high accuracy but rely on supervision or fiducials, limiting robustness due to domain shift, occlusion, and deployment at scale. ROPES adopts a generative view: images reflect latent factors such as geometry, lighting, background, and robot joints. The goal is to recover controllable latent variables, i.e., those linked to actuation. Interventional Causal Representation Learning (CRL) theory establishes that comparing distributions induced by interventions enables identifiability. In robotics, such interventions arise naturally by commanding actuators of various joints and recording images under varied controls. ROPES learns a disentangled 6-dimensional representation of a robot arm's state via a three-stage pipeline: (i) compressing images with an autoencoder, (ii) contrasting across interventional domains to estimate score differences, and (iii) refining these into six structured variables, where the final step is regularized using score differences to align estimated latent variables with the true joint angles. In semi-synthetic manipulator experiments, ROPES recovers latent representations that are highly disentangled, strongly correlated with true joint angles, and stable across settings. Crucially, this is achieved by leveraging only distributional changes, without using a single pose label at any step. This paper concludes by outlining challenges and positioning robot pose estimation as a near-practical testbed for measuring progress in CRL.
Submission Number: 73
Loading