LangOcc: Open Vocabulary Occupancy Estimation via Volume Rendering

Simon Boeder; Fabian Gigengack; Benjamin Risse

LangOcc: Open Vocabulary Occupancy Estimation via Volume Rendering

Simon Boeder, Fabian Gigengack, Benjamin Risse

Published: 23 Mar 2025, Last Modified: 24 Mar 20253DV 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Occupancy Estimation, Open Vocabulary, Volume Rendering, CLIP

TL;DR: Open-vocabulary occupancy estimation by distilling CLIP features into a 3D model via differentiable volume rendering.

Abstract: The 3D occupancy estimation task has become an important challenge in the area of vision-based autonomous driving recently. However, most existing camera-based methods rely on costly 3D voxel labels or LiDAR scans for training, limiting their practicality and scalability. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called LangOcc, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a weakly-supervised manner by rendering our estimations back to 2D space, where features can easily be aligned with CLIP. This training mechanism automatically supervises the scene geometry, allowing for a straight-forward and powerful training method without any explicit geometry supervision. LangOcc outperforms LiDAR-supervised competitors in open vocabulary occupancy with a mAP of $22.7$ by a large margin ($+4.3 \%$), solely relying on vision-based training. We also achieve a mIoU score of $11.84$ on the Occ3D-nuScenes dataset, surpassing previous vision-only semantic occupancy estimation methods ($+1.71\%$), despite not being limited to a specific set of categories.

Supplementary Material: zip

Submission Number: 202

Loading