Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction

Seamie Hayes, Ganesh Sistu, Ciarán Eising

Published: 23 Apr 2025, Last Modified: 12 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: In Bird’s Eye View perception, significant emphasis is placed on deploying well-performing, convoluted model architectures and leveraging as many sensor modalities as possible to reach maximal performance. This paper investigates whether foundation models and multi-sensor deployments are essential for enhancing BEV perception. We examine the relative importance of advanced feature extraction versus the number of sensor modalities and assess whether foundation models can address feature extraction limitations and reduce the need for extensive training data. Specifically, incorporating the self-supervised DINOv2 for feature extraction and Metric3Dv2 for depth estimation into the Lift-Splat-Shoot framework results in a 7.4 IoU point increase in vehicle segmentation, representing a relative improvement of 22.4%, while requiring only half the training data and iterations compared to the original model. Furthermore, using Metric3Dv2’s depth maps as a pseudo-LiDAR point cloud within the Simple-BEV model improves IoU by 2.9 points, marking a 6.1% relative increase compared to the Camera-only setup. Finally, we extend the famous Gaussian Splatting BEV perception models, GaussianFormer and GaussianOcc, through multimodal deployment. The addition of LiDAR information in GaussianFormer results in a 9.4-point increase in mIoU, a 48.7% improvement over the Camera-only model, nearing state-of-the-art multimodal performance even with limited LiDAR scans. In the self-supervised GaussianOcc model, incorporating LiDAR leads to a 0.36-point increase in mIoU, representing a 3.6% improvement over the Camera-only model. This limited gain can be attributed to the absence of LiDAR encoding and the self-supervised nature of the model. Overall, our findings highlight the critical role of foundation models and multi-sensor integration in advancing BEV perception. By leveraging sophisticated foundation models and multi-sensor deployment, we can further model performance and reduce data requirements, addressing key challenges in BEV perception