Abstract: Remote sensing multiview image segmentation is essential for achieving accurate and consistent stereoscopic perception of target scenes. This task involves processing RGB images from multiple viewpoints to generate high-accuracy, view-consistent semantic segmentation across all views. Traditional training-based methods struggle with maintaining cross-view consistency, while optimization-driven approaches using implicit neural networks improve view consistency but suffer from slow parameter optimization and inference. To overcome these limitations, we propose a novel Gaussian splatting-based semantic segmentation framework. Our method efficiently projects the color attributes and semantic features of 3-D Gaussians onto the image plane, enabling the simultaneous generation of both RGB images and segmentation outputs. By leveraging explicit spatial structures and a splatting rendering strategy, our approach significantly enhances optimization efficiency and rendering speed. In addition, we incorporate SAM2 to generate pseudo-labels for boundary regions, addressing the lack of supervision in sparsely labeled views (e.g., 3%). To further enforce cross-view consistency and feature coherence of 3-D Gaussians, we introduce a two-level aggregation loss that operates at both the 2-D feature map and 3-D spatial levels. Extensive experiments across nine datasets demonstrate the superiority of our method, achieving competitive segmentation quality with limited supervisory views. Notably, our approach reduces rendering (inference) times by 90%, while improving the average mean intersection over union (mIoU) by up to 3.5%.
External IDs:doi:10.1109/tgrs.2025.3558217
Loading