Implicit Ray Transformers for Multiview Remote Sensing Image Segmentation

Zipeng Qi; Hao Chen; Chenyang Liu; Zhenwei Shi; Zhengxia Zou

Implicit Ray Transformers for Multiview Remote Sensing Image Segmentation

Zipeng Qi, Hao Chen, Chenyang Liu, Zhenwei Shi, Zhengxia Zou

Published: 01 Jan 2023, Last Modified: 12 Nov 2024IEEE Trans. Geosci. Remote. Sens. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The mainstream convolutional neural network (CNN)-based remote sensing (RS) image semantic segmentation approaches typically rely on massively labeled training data. Such a paradigm struggles with the problem of RS multiview scene segmentation with limited labeled views due to the lack of consideration of 3-D information within the scene. In this article, we propose “implicit ray transformer (IRT)” based on implicit neural representation (INR) for RS scene semantic segmentation with sparse labels (5% of the images being labeled). We explore a new way of introducing the multiview 3-D structure priors to the task for accurate and view-consistent semantic segmentation. The proposed method includes a two-stage learning process. In the first stage, we optimize a neural field to encode the color and 3-D structure of the RS scene based on multiview images. In the second stage, we design a ray transformer to leverage the relations between the neural field 3-D features and 2-D texture features for learning better semantic representations. Different from previous methods that only consider 3-D priors or 2-D features, we incorporate additional 2-D texture information and 3-D priors by broadcasting CNN features to different point features along the sampled ray. To verify the effectiveness of the proposed method, we construct a challenging dataset containing six synthetic sub-datasets collected from the Carla platform and three real sub-datasets from Google Maps. Experiments show that the proposed method outperforms the CNN-based methods and the state-of-the-art INR-based segmentation methods in quantitative and qualitative metrics. The ablation study shows that under a limited number of fully annotated images, the combination of the 3-D structure priors and 2-D texture can significantly improve the performance and effectively complete missing semantic information in novel views. Experiments also demonstrate that the proposed method could yield geometry-consistent segmentation results against illumination changes and viewpoint changes. Our data and code will be public.

Loading