Abstract: Detecting small, arbitrarily oriented objects in complex remote sensing images remains a significant challenge in computer vision. Conventional CNN‑based detectors struggle with fine‑grained structures of small, arbitrarily oriented objects. Moreover, existing single‑stage methods rarely exploit cross‑modal cues, leaving a semantic gap between category priors and visual features.To address these issues, we propose a novel detection framework incorporating a FocusConv module and a CLIP-guided head. The FocusConv module dynamically adjusts sampling points based on region-of-interest (RoI) classification scores to enhance feature extraction in target areas, improving small object representation. The CLIP-guided head uses text-encoded categories to align semantic information with image features through pixel-text matching, effectively guiding the detection head. Experimental results on benchmarks such as DOTA-v1.0, DOTA-v1.5 demonstrate that our method outperforms existing single-stage detectors, achieving state-of-the-art performance under single-scale conditions.
External IDs:dblp:conf/icic/WangYTXCLXW25
Loading