SfmOcc: Vision-Based 3D Semantic Occupancy Prediction in Urban Environments

Rodrigo Marcuzzi; Lucas Nunes; Elias Marks; Louis Wiesmann; Thomas Läbe; Jens Behley; Cyrill Stachniss

SfmOcc: Vision-Based 3D Semantic Occupancy Prediction in Urban Environments

Rodrigo Marcuzzi, Lucas Nunes, Elias Marks, Louis Wiesmann, Thomas Läbe, Jens Behley, Cyrill Stachniss

Published: 01 Jan 2025, Last Modified: 14 May 2025IEEE Robotics Autom. Lett. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Semantic scene understanding is crucial for autonomous systems and 3D semantic occupancy prediction is a key task since it provides geometric and possibly semantic information of the vehicle's surroundings. Most existing vision-based approaches to occupancy estimation rely on 3D voxel labels or segmented LiDAR point clouds for supervision. This limits their application to the availability of a 3D LiDAR sensor or the costly labeling of the voxels. While other approaches rely only on images for training, they usually supervise only with a few consecutive images and optimize for proxy tasks like volume reconstruction or depth prediction. In this paper, we propose a novel method for semantic occupancy prediction using only vision data also for supervision. We leverage all the available training images of a sequence and use bundle adjustment to align the images and estimate camera poses from which we then obtain depth images. We compute semantic maps from a pre-trained open-vocabulary image model and generate occupancy pseudo labels to explicitly optimize for the 3D semantic occupancy prediction task. Without any manual or LiDAR-based labels, our approach predicts full 3D occupancy voxel grids and achieves state-of-the-art results for 3D occupancy prediction among methods trained without labels.

Loading