Abstract: 3D occupancy prediction (OCC) aims to estimate and predict the semantic occupancy state of the surrounding environment, which is crucial for scene understanding and reconstruction in the real world. However, existing methods for 3D OCC mainly rely on surround-view camera images, whose performance is still insufficient in some challenging scenarios, such as low-light conditions. To this end, we propose a new multi-modal fusion network for 3D occupancy prediction by fusing features of LiDAR point clouds and surround-view images, called FusionOcc. Our model fuses features of these two modals in 2D and 3D space, respectively. By integrating the depth information from point clouds, a cross-modal fusion module is designed to predict a 2D dense depth map, enabling an accurate depth estimation and a better transition of 2D image features into 3D space. In addition, features of voxelized point clouds are aligned and merged with image features converted by a view-transformer in 3D space. Experiments show that FusionOcc establishes the new state of the art on Occ3D-nuScenes dataset, achieving a mIoU score of 35.94% (without visibility mask) and 56.62% (with visibility mask), showing an average improvement of 3.42% compared to the best previous method. Our work provides a new baseline for further research in multi-modal fusion for 3D occupancy prediction.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: 3D occupancy prediction (OCC) aims to estimate and predict the semantic occupancy state of the surrounding environment, which is crucial for scene understanding and reconstruction in the real world. OCC is now widely used in the field of autonomous driving, providing drivers with detailed road conditions and a better driving experience. Existing methods for 3D OCC mainly rely on surround-view camera images, whose performance is still insufficient in some challenging scenarios, such as low-light conditions. Our work proposes a multi-modal fusion network for 3D occupancy prediction by fusing features of LiDAR point clouds and surround-view images. The model fuses features of these two modals in both 2D and 3D space. For 2D fusion, images from cameras and the depth map from LiDAR point clouds are fused via a cross-modal module, and this fusion will help to generate an accurate and dense depth map, which could accurately project the image feature in 3D space. For 3D fusion, image features are projected into 3D space via a view transformer. Features of voxelized point clouds are also transformed into the vehicle coordinate system beforehand to be accommodated with the converted image features.
Our work effectively preserves both positional information of point clouds and semantic information of images, for each voxel predicted by our model in 3D space, we can utilize the mapping matrix of the vehicle to locate the corresponding pixel in the 2D image and the point clouds, which preserves the interpretability of the fusion process between the two modalities.
Supplementary Material: zip
Submission Number: 3275
Loading