Open-Vocabulary Object Detection for Low-Altitude Scenarios Using RGB-Infrared Data: A Benchmark and A New Method

Zhiqiang Wang; Zekai Zhang; Qinghui Chen; Maomao Xiong; jinglin zhang

Open-Vocabulary Object Detection for Low-Altitude Scenarios Using RGB-Infrared Data: A Benchmark and A New Method

Zhiqiang Wang, Zekai Zhang, Qinghui Chen, Maomao Xiong, jinglin zhang

17 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Open-Vocabulary Object Detection, Low-Altitude Scenarios, Using Aligned RGB-Infrared Data

TL;DR: Open-Vocabulary Object Detection for Low-Altitude Scenarios Using Aligned RGB-Infrared Data

Abstract: Traditional object detection methods are limited by closed datasets, while open-vocabulary object detection (OVOD) overcomes this limitation. However, most existing OVOD approaches are trained on natural scene images and struggle to generalize to low-altitude scenes images (e.g., UAV-captured images) due to domain differences between the datasets. Therefore, this paper aims to advance research on open-vocabulary object detection in low-altitude scenarios. Unlike most existing open-vocabulary methods, which are trained solely on RGB images and corresponding textual annotations, this paper proposes the first low-altitude open-vocabulary object detection dataset using aligned RGB-Infrared images, named RGB-Infrared OVOD Dataset (RIOVOD), equipped with vocabulary-level annotations. We aim to leverage the complementary information between the two modalities, specifically by utilizing the texture features of RGB images and the thermal radiation information from infrared images, to further enhance the performance of OVOD methods. Building on this, we proposed a new architecture for open-vocabulary object detection in low-altitude scenarios using RGB-Infrared data, which is named LSRI. Extensive experiments show that our method outperforms other approaches, achieving a 0.356 $AP_{50}$ on the RGB-Infrared OVOD Dataset, compared to 0.246 and 0.302 $AP_{50}$ achieved by single-modal (RGB and Infrared) methods, respectively.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 8229

Loading