Abstract: Recently, self-supervised 6D object pose estimation,
where synthetic images with object poses (sometimes jointly
with un-annotated real images) are used for training, has
attracted much attention in computer vision. Some typical
works in literature employ a time-consuming differentiable
renderer for object pose prediction at the training stage, so
that (i) their performances on real images are generally limited
due to the gap between their rendered images and real
images and (ii) their training process is computationally expensive.
To address the two problems, we propose a novel
Network for Self-supervised Monocular Object pose estimation
by utilizing the predicted Camera poses from unannotated
real images, called SMOC-Net. The proposed
network is explored under a knowledge distillation framework,
consisting of a teacher model and a student model.
The teacher model contains a backbone estimation module
for initial object pose estimation, and an object pose refiner
for refining the initial object poses using a geometric constraint
(called relative-pose constraint) derived from relative
camera poses. The student model gains knowledge for
object pose estimation from the teacher model by imposing
the relative-pose constraint. Thanks to the relative-pose
constraint, SMOC-Net could not only narrow the domain
gap between synthetic and real data but also reduce the
training cost. Experimental results on two public datasets
demonstrate that SMOC-Net outperforms several state-ofthe-
art methods by a large margin while requiring much less
training time than the differentiable-renderer-based methods.
0 Replies
Loading