Uncertainty-Guided Cross-Modal Distillation for Category-Level Object Pose Estimation

Published: 10 Nov 2025, Last Modified: 25 Mar 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Recent years have seen significant advancements in category-level object pose estimation, largely driven by multimodal (RGB-D) approaches. Despite their success, depth-only methods remain widely adopted in practical applications due to their superior computational efficiency and ease of deployment. However, these methods typically suffer from a noticeable performance gap compared to multimodal methods. To bridge this gap, we propose a novel framework, Cross-Modal Uncertainty Distillation for Pose Estimation (CMUD-Pose), which transfers discriminative knowledge from an RGB-D teacher to a depth-only student network. Furthermore, to mitigate overfitting induced by the modality gap, we propose Cross-Modal Uncertainty Distillation (CMUD), which utilizes a learned uncertainty-aware weighting mechanism for adaptively assigning importance to training samples. By incorporating uncertainty into the distillation process, CMUD allows the student model to focus selectively on reliable and transferable cross-modal features. Extensive experiments on the REAL275 and CAMERA25 benchmarks show that our method significantly improves the performance of depth-only pose estimation models.
Loading