LLaMA-Unidetector: An LLaMA-Based Universal Framework for Open-Vocabulary Object Detection in Remote Sensing Imagery
Abstract: The object detection is a crucial task in the computer vision for remote sensing applications. However, the reliance of traditional methods on predefined and trained object categories limits their applicability in open-world scenarios. A key challenge in open-vocabulary object detection lies in accurately identifying unseen objects. Existing approaches often focus solely on detecting object locations, struggling to recognize the categories of previously unseen targets. To address this issue, we propose a novel benchmark, where models are trained on known base classes and evaluated on their performance in detecting and recognizing unseen or novel classes. To this end, we introduce LLaMA-Unidetector, a universal framework that incorporates textual information into a closed-set detector, enabling the generalization to open-set scenarios. Our LLaMA-Unidetector leverages a decoupled learning strategy that separates localization and recognition. In the first stage, a class-agnostic detector identifies objects, distinguishing only between foreground and background. In the second stage, the detected foreground objects are passed through TerraOV-LLM, a multimodal large language model (MLLM), for recognition, utilizing the strong generalization capabilities of large language models to infer the correct categories. We propose a self-built vision question answering (VQA) remote sensing dataset, TerraVQA, and conduct extensive experiments on the NWPU-VHR10, DOTA1.0, and DIOR datasets. The LLaMA-Unidetector achieves impressive results, with a performance of 75.46% AP, 50.22% AP, and 51.38% AP on the zero-shot detection benchmarks for the NWPU-VHR10, DOTA1.0, and DIOR datasets, respectively. Our source code is available at: https://github.com/ChloeeGrace/LLaMA-Unidetector
External IDs:dblp:journals/tgrs/XieWZSCZL25
Loading