OS-W2S: An Automatic Labeling Engine for Language-Guided Open-Set Aerial Object Detection

OS-W2S: An Automatic Labeling Engine for Language-Guided Open-Set Aerial Object Detection

ICLR 2026 Conference Submission14647 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Open-Set Aerial Object Detection, Automatic Label Engine, Multi-instance Open-set Aerial Dataset

Abstract: In recent years, language-guided open-set aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary-level descriptions, which fail to meet the demands of fine-grained open-world detection. To address this limitation, we propose constructing a large-scale language-guided open-set aerial detection dataset, encompassing three levels of language guidance: from words to phrases, and ultimately to sentences. Centered around an open-source large vision-language model and integrating image-operation-based preprocessing with BERT-based postprocessing, we present the $\textbf{OS-W2S Label Engine}$, an automatic annotation pipeline capable of handling diverse scene annotations for aerial images. Using this label engine, we expand existing aerial detection datasets with rich textual annotations and construct a novel benchmark dataset, called Multi-instance Open-set Aerial Dataset ($\textbf{MI-OAD}$), addressing the limitations of current remote sensing grounding data and enabling effective language-guided open-set aerial detection. Specifically, MI-OAD contains 163,023 images and 2 million image-caption pairs, with multiple instances per caption, approximately 40 times larger than comparable datasets. To demonstrate the effectiveness and quality of MI-OAD, we evaluate three representative tasks: language-guided open-set aerial detection, open-vocabulary aerial detection (OVAD), and remote sensing visual grounding (RSVG). On language-guided open-set aerial detection, training on MI-OAD lifts Grounding DINO by +31.1 AP$_{50}$ and +34.7 Recall@10 with sentence-level inputs under zero-shot transfer. Moreover, using MI-OAD for pre-training yields state-of-the-art performance on multiple existing OVAD and RSVG benchmarks, validating both the effectiveness of the dataset and the high quality of its OS-W2S annotations. More details are available at \url{https://anonymous.4open.science/r/MI-OAD}.

Primary Area: datasets and benchmarks

Submission Number: 14647

Loading