Feature-level Fusion of 2D Images and 3D LiDAR Point Clouds for Semantic Segmentation

Published: 15 Oct 2025, Last Modified: 10 Nov 2025BNAIC/BeNeLearn 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Type A (Regular Papers)
Keywords: Semantic segmentation, 2D-3D fusion, Feature fusion, Multimodal fusion
Abstract: Semantic segmentation is an essential task in autonomous systems, including those used for driving, robot navigation, and medical diagnosis. Although there are methods for 2D segmentation using convolutional neural networks (CNN) and 3D segmentation with 3D models, the complementary nature of 2D and 3D data should not be overlooked. This research explores the multimodal fusion of 2D images and 3D LiDAR point clouds for semantic segmentation in both structured and unstructured environments. Building on the DeepViewAgg framework, we aim to investigate how feature fusion impacts semantic segmentation compared to models that utilize only 2D or only 3D data. The approach involves training a model for each modality and assessing its performance. On KITTI-360, fusion raises the mean IoU from 54.20 (3D-only) and 56.70 (2D-only) to 57.53, with the most notable improvement seen in thin classes like ``pole'' (+21.3 points). In the WildScenes natural dataset, it reaches 33.0 mIoU, surpassing the 2D and 3D baseline models by 5.0 points. These results show that multimodal fusion can outperform single-modal approaches, especially for scene elements that benefit from combined 2D-3D cues.
Serve As Reviewer: ~Muhammad_Shoaib_Sarwar1
Submission Number: 23
Loading