Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction
Abstract: In Bird’s Eye View perception, significant emphasis is placed on deploying well-performing,
convoluted model architectures and leveraging as many sensor modalities as possible to reach maximal
performance. This paper investigates whether foundation models and multi-sensor deployments are essential
for enhancing BEV perception. We examine the relative importance of advanced feature extraction versus the
number of sensor modalities and assess whether foundation models can address feature extraction limitations
and reduce the need for extensive training data. Specifically, incorporating the self-supervised DINOv2
for feature extraction and Metric3Dv2 for depth estimation into the Lift-Splat-Shoot framework results in
a 7.4 IoU point increase in vehicle segmentation, representing a relative improvement of 22.4%, while
requiring only half the training data and iterations compared to the original model. Furthermore, using
Metric3Dv2’s depth maps as a pseudo-LiDAR point cloud within the Simple-BEV model improves IoU
by 2.9 points, marking a 6.1% relative increase compared to the Camera-only setup. Finally, we extend the
famous Gaussian Splatting BEV perception models, GaussianFormer and GaussianOcc, through multimodal
deployment. The addition of LiDAR information in GaussianFormer results in a 9.4-point increase in mIoU, a
48.7% improvement over the Camera-only model, nearing state-of-the-art multimodal performance even with
limited LiDAR scans. In the self-supervised GaussianOcc model, incorporating LiDAR leads to a 0.36-point
increase in mIoU, representing a 3.6% improvement over the Camera-only model. This limited gain can
be attributed to the absence of LiDAR encoding and the self-supervised nature of the model. Overall, our
findings highlight the critical role of foundation models and multi-sensor integration in advancing BEV
perception. By leveraging sophisticated foundation models and multi-sensor deployment, we can further
model performance and reduce data requirements, addressing key challenges in BEV perception
Loading