Abstract: Current encoder-decoder methods for remote sensing image captioning (RSIC) avoids fine-grained structural representation of objects due to the lack of prominent encoding frameworks. This paper proposes a novel structural representative network (SRN) for acquiring fine-grained structures of remote sensing images (RSI) for generating semantically meaningful captions. Initially, we employ SRN on top of the final layers of the convolutional neural network (CNN) for attaining the spatially transformed RSI features. A multi-stage decoder is incorporated into the extracted features of SRN to produce fine-grained meaningful captions. The efficacy of our proposed methodology is exhibited on two RSIC datasets, i.e Sydney-Captions dataset, and the UCM-Captions dataset.
Loading