Label-Efficient LiDAR Scene Understanding with 2D-3D Vision Transformer Adapters

Julia Hindel; Rohit Mohan; Jelena Bratulić; Daniele Cattaneo; Thomas Brox; Abhinav Valada

Label-Efficient LiDAR Scene Understanding with 2D-3D Vision Transformer Adapters

Julia Hindel, Rohit Mohan, Jelena Bratulić, Daniele Cattaneo, Thomas Brox, Abhinav Valada

Published: 18 Apr 2025, Last Modified: 13 May 2025ICRA 2025 FMNS SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Semantic scene understanding, Computer vision for transportation, Deep learning for visual perception

TL;DR: BALViT is a novel approach that leverages frozen vision foundation models as amodal feature encoders, integrating range-view and bird’s-eye-view LiDAR encoding to enable a label-efficient LiDAR semantic segmentation network.

Abstract: LiDAR semantic segmentation pre-training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision-based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision foundation models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range-view and bird’s-eye-view LiDAR encoding mechanisms, which we combine through 3D positional encoding. While the range-view features are processed through a frozen image backbone, our bird’s-eye-view branch enhances them through multiple cross-attention interactions. Thereby, we continuously improve the vision network with domain-dependent knowledge, resulting in a strong LiDAR encoding mechanism with minimal parameter updates. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state-of-the-art methods on small data regimes. We make the code and models publicly available at http://balvit.cs.uni-freiburg.de.

Supplementary Material: pdf

Submission Number: 14

Loading