From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection

Guoting Wei; Yu Liu; Xia Yuan; XIZHE XUE; Linlin Guo; Yifan Yang; Chunxia Zhao; Zongwen Bai; Haokui Zhang; Rong Xiao

From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection

Guoting Wei, Yu Liu, Xia Yuan, XIZHE XUE, Linlin Guo, Yifan Yang, Chunxia Zhao, Zongwen Bai, Haokui Zhang, Rong Xiao

10 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Open-Set Aerial Object Detection, Automatic Label Engine, Multi-instance Open-set Aerial Dataset

Abstract: In recent years, language-guided open-world aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary, which fails to meet the demands of more fine-grained open-world detection. To address this limitation, we propose constructing a large-scale language-guided open-set aerial detection dataset, encompassing three levels of language guidance: from words to phrases, and ultimately to sentences. Centered around an open-source large vision-language model and integrating image-operation-based preprocessing with BERT-based postprocessing, we present the $\textbf{OS-W2S Label Engine}$, an automatic annotation pipeline capable of handling diverse scene annotations for aerial images. Using this label engine, we expand existing aerial detection datasets with rich textual annotations and construct a novel benchmark dataset, called Multi-instance Open-set Aerial Dataset $(\textbf{MI-OAD})$, addressing the limitations of current remote sensing grounding data and enabling effective open-set aerial detection. Specifically, MI-OAD contains 163,023 images and 2 million image-caption pairs, with multiple instances per caption, approximately 40 times larger than the comparable datasets. We also employ state-of-the-art open-set methods from the natural image domain, trained on our proposed dataset, to validate the model’s open-set detection capabilities. For instance, when trained on our dataset, Grounding DINO achieves improvements of 31.1 $AP_{50}$ and 34.7 Recall@10 for sentence inputs under zero-shot transfer conditions. Both the dataset and the Label Engine will be made publicly available.

Croissant File: json

Dataset URL: https://kaggle.com/datasets/070cdff2f649a10895c6fa09a45a58d00982afd8a8ba573696f521edd59cc028

Code URL: https://anonymous.4open.science/r/MI-OAD

Supplementary Material: pdf

Primary Area: Datasets & Benchmarks for applications in computer vision

Submission Number: 1101

Loading