Abstract: Few-shot object detection (FSOD) has received numerous attention due to the difficulty and time-consuming of labeling objects. Recent researches achieve excellent performance in a natural scene by only using a few instances of novel classes to fine-tune the last prediction layer of the model well-trained on plentiful base data. However, compared with natural scene objects with a single direction and small size variety, the direction and size of the objects in remote sensing images (RSIs) vary greatly. The methods proposed for the natural scene cannot be directly applied to RSIs. In this article, we first propose a strong baseline for RSIs. It fine-tunes all detector components acting on high-level features and effectively improves the performance of novel classes. Further analyzing the results of the baseline, we find that the error for novel classes is mainly concentrated in classification. It misclassifies novel classes as confusable base classes or backgrounds due to the difficulty in extracting generalized information from limited instances. As is well-known, text-modal knowledge can highly summarize the generalized and unique characteristics of categories. Thus, we introduce text-modal descriptions for each category and propose an FSOD method guided by TExt-MOdal knowledge, called TEMO. Specifically, a text-modal knowledge extractor and a cross-modal assembly module are proposed to extract text features and fuse the text-modal features into visual-modal features. The fused features greatly reduce the classification confusion of novel classes. Furthermore, we introduce a mask strategy and a separation loss to avoid over-fitting and ambiguity of text-modal features. Experimental results on detection in optical remote sensing images (DIOR), Northwestern Polytechnical University (NWPU), and fine-grained object recognition in high-resolution remote sensing imagery (FAIR1M) illustrate that our TEMO achieves state-of-the-art performance in all settings.
Loading