Abstract: Fine-Grained Oriented Object Detection (FGOOD) aims to simultaneously categorize and localize fine-grained objects using oriented bounding box predictions. In this paper, we propose to exploit rich text features to discern fine-grained object categories from subordinate coarse-grained semantic classes, such as Boeing 747 vs. airplane. To this end, we leverage the emerging Contrastive Language-Image Pre-training (CLIP) model, which provides image-text representations to bridge the gap between oriented localization representations and fine-grained semantics. Our method is distinct from early FGOOD approaches which commonly focus on region proposal refinement but overlook the inter-class relations between fine-grained categories, resulting in inadequately discriminative features to discern the fine-grained categories. Specifically, our simple yet effective language-guided fine-grained oriented object detector first integrates hierarchical information from multi-granularity labels into a rotated object detection framework, establishing a shared representation space for Region of Interest (RoI) features and text features. Then, we extract fine-grained discriminative features from those RoI features using our elaborated Fine-grained Orthogonal Decomposition (FOD) and Fine-grained Orthogonal Feature Queue (FOQ). Extensive experiments validate the superiority of our approach, demonstrating a substantial performance improvement over state-of-the-art oriented object detectors on two FGOOD datasets, FAIR1M and HRSC2016, with a notable 1.67% and 3.42% mAP improvement on FAIR1M and HRSC2016.
Loading