Abstract: Self-supervised models have recently achieved no-
table advancements, particularly in the domain of semantic
occupancy prediction. These models utilize sophisticated loss
computation strategies to compensate for the absence of ground-
truth labels. For instance, techniques such as novel view
synthesis, cross-view rendering, and depth estimation have been
explored to address the issue of semantic and depth ambiguity.
However, such techniques typically incur high computational
costs and memory usage during the training stage, especially
in the case of novel view synthesis. To mitigate these is-
sues, we propose 3D pseudo-ground-truth labels generated by
the foundation models Grounded-SAM and Metric3Dv2, and
harness temporal information for label densification. Our 3D
pseudo-labels can be easily integrated into existing models,
which yields substantial performance improvements, with mIoU
increasing by 45%, from 9.73 to 14.09, when implemented
into the OccNeRF model. This stands in contrast to ear-
lier advancements in the field, which are often not readily
transferable to other architectures. Additionally, we propose a
streamlined model, EasyOcc, achieving 13.86 mIoU. This model
conducts learning solely from our labels, avoiding complex
rendering strategies mentioned previously. Furthermore, our
method enables models to attain state-of-the-art performance
when evaluated on the full scene without applying the camera
mask, with EasyOcc achieving 7.71 mIoU, outperforming the
previous best model by 31%. These findings highlight the
critical importance of foundation models, temporal context, and
the choice of loss computation space in self-supervised learning
for comprehensive scene understanding.
Loading