Uncertainty-Guided Cross-Modal Distillation for Category-Level Object Pose Estimation
Abstract: Recent years have seen significant advancements in
category-level object pose estimation, largely driven by multimodal
(RGB-D) approaches. Despite their success, depth-only methods
remain widely adopted in practical applications due to their superior computational efficiency and ease of deployment. However,
these methods typically suffer from a noticeable performance gap
compared to multimodal methods. To bridge this gap, we propose
a novel framework, Cross-Modal Uncertainty Distillation for Pose
Estimation (CMUD-Pose), which transfers discriminative knowledge from an RGB-D teacher to a depth-only student network. Furthermore, to mitigate overfitting induced by the modality gap, we
propose Cross-Modal Uncertainty Distillation (CMUD), which utilizes a learned uncertainty-aware weighting mechanism for adaptively assigning importance to training samples. By incorporating
uncertainty into the distillation process, CMUD allows the student
model to focus selectively on reliable and transferable cross-modal
features. Extensive experiments on the REAL275 and CAMERA25
benchmarks show that our method significantly improves the
performance of depth-only pose estimation models.
Loading