Bridging the Modality Gap: Training-free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes

Published: 12 Jun 2025, Last Modified: 18 Aug 2025Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025EveryoneCC BY 4.0
Abstract: Large-scale Vision-Language Models (VLMs) have demonstrated remarkable few-shot learning capabilities across various visual tasks. However, effectively adapting these models to remote sensing, a domain characterized by specialized object appearances and scarce labeled data, remains non-trivial. In this work, we present a training-free adaptation strategy that employs region-level visual prototypes for object detection in remote sensing imagery. Instead of relying on textual prompts, we directly derive representative embeddings from a small number of annotated bounding boxes, capturing domain-specific characteristics that generic language encoders may overlook. To compensate for the resulting modality gap between region-region and region-text similarities, we introduce an affine normalization step that re-calibrates prototype-based scores without any model fine-tuning. We evaluate our method on the DIOR and NWPU-VHR10 benchmarks, demonstrating consistent and substantial improvements over previous training-free approaches. Moreover, we offer an in-depth analysis of different prototype construction and aggregation strategies, revealing how carefully chosen protocols can further strengthen few-shot detection in remote sensing.
Loading