Memory-Augmented Detection Transformer for Few-Shot Object Detection in Remote Sensing Imagery

Abdullah Azeem, Zhengzhou Li, Abubakar Siddique, Yuting Zhang, Dong Cao

Published: 01 Jan 2025, Last Modified: 06 Nov 2025IEEE Transactions on Geoscience and Remote SensingEveryoneRevisionsCC BY-SA 4.0

Abstract: The few-shot object detection (FSOD) in remote sensing (RS) faces a fundamental challenge in balancing contextual feature learning with the representation stability. Current approaches either excel at capturing rich contextual relationships through multimodal architectures but struggle with catastrophic forgetting or maintain stable feature representations through incremental learning frameworks at the cost of contextual understanding. This limitation is particularly acute in RS, where complex spatial-contextual dependencies form distinctive co-occurrence patterns important for accurate object detection. We present a memory-augmented detection transformer (MemDeT) that bridges this gap through three key modules: 1) a contextual layerwise fusion (CLF) module that progressively integrates visual and textual information across transformer layers through an adaptive attention mechanism and unified cross-modal fusion, enabling both fine-grained object feature extraction at lower layers and abstract contextual relationship learning at higher layers; 2) a unified episodic memory (UEM) that serves as a dynamic knowledge repository, employing similarity-surprise-based update mechanisms with structured key-value memory organization to strategically retrieve and update relevant past experiences while preserving base knowledge through contextual scoring and dynamic retrieval; and 3) a memory-augmented decoder (MAD) that generates context-aware queries by combining current visual observations with accumulated contextual knowledge through memory-prioritized attention and progressive query refinement. Extensive experiments on RS datasets, NWPU VHR-10, DIOR, and iSAID, demonstrate that MemDeT significantly outperforms state-of-the-art methods in both contextual understanding and knowledge retention. Cross-dataset evaluation further validates MemDeT’s robust generalization capabilities, successfully transferring knowledge from DIOR to iSAID’s urban scenes, particularly in scenarios requiring strong contextual understanding with limited training examples.

External IDs:doi:10.1109/tgrs.2025.3551513