Interactive Concept Network Enhanced Transformer for Remote Sensing Image Captioning

Cheng Zhang, Zhongle Ren, Biao Hou, Jianhua Meng, Weibin Li, Licheng Jiao

Published: 2025, Last Modified: 15 Apr 2025IEEE Trans. Geosci. Remote. Sens. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Remote sensing image captioning plays an important role in advancing remote sensing image understanding with natural language generation. However, it is difficult to generate accurate semantic descriptions of crucial objects and their relationships, due to large coverage and abundant information in remote sensing images. To address these issues, this article proposes a novel interactive concept network enhanced transformer (ICNET) for remote sensing image captioning. First, multilevel visual features are extracted within a local and global feature extraction module. To comprehensively capture key objects in the local features, a concept mapping network (CMN) is constructed to project multiscale local features onto high-level semantic concepts of the objects. This allows for the integration of the relevant feature vectors in the visual feature mapping into multiple relatively independent word features, thus bridging the gap between visual features and semantic concepts. Subsequently, a global feature enhancement (GFE) module is introduced to boost the discrimination of global relationships and filter irrelevant content. Finally, to aggregate semantic concepts and global features, a transformer equipped with a concept interaction module (CIM) is designed to facilitate feature alignment and generate captions with proper categories and relationships. The experimental results on three remote sensing image captioning datasets demonstrate the superiority of the proposed method.