Advancing Open-Set Object Detection in Remote Sensing Using Multimodal Large Language Model

Published: 01 Jan 2025, Last Modified: 08 Jul 2025WACV (Workshops) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In recent years, open-set recognition in remote sensing has attracted significant attention. The goal is to identify unknown objects during inference, extending the generalization of models trained on labeled data for known objects. However, obtaining bounding box annotations for unknown object categories at a large scale is prohibitively expensive. Multimodal large language models (MLLMs) offer a promising alternative, enabling the discovery of unknown object categories without the need for human intervention in labeling novel classes. In this paper, we propose a novel methodology that leverages MLLMs to address the dual challenges of detecting and categorizing unknown objects in remote sensing imagery. By integrating three diverse datasets-DOTA, DIOR, and NWPU VHR-10-we simulate real-world open-set conditions by partitioning object classes into known and unknown categories. The proposed methodology employs a two-step approach: (1) open-set object region detection, where known objects are identified using a model trained on labeled data, while threshold-based region proposal extraction is applied to detect unknown objects; and (2) discovery and semantic labeling of unknown objects using MLLM-based textual annotation. The contextual descriptions generated by the MLLM serve as human-interpretable pseudo-labels, which are further validated using vision-language similarity metrics. Experimental results demonstrate significant improvements in both detection (achieving high recall for unknown objects) and discovery (producing meaningful and accurate categorizations of novel objects). This work highlights the transformative potential of MLLMs for interpreting unknowns and paves the way for more robust open-set object detection in the remote sensing domain.
Loading