Enhancing VLMs for Satellite Remote Sensing Image Analysis via Contrastive Decoding

Zixuan Shangguan, Ke Xing, Jingrui Zhang, Xiaoyi Fan, Yang Zhou, Jingda Qiao

Published: 2025, Last Modified: 30 May 2026CloudCom 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision language models (VLMs) have opened new avenues for satellite remote sensing image analysis and have shown promise across multiple tasks. However, in the absence of a remote sensing-oriented general VLM, existing approaches rely on retraining generic VLMs with remote sensing datasets to adapt to downstream tasks. This practice is inherently affected by two factors: 1) generic VLMs are pretrained on massive web data containing noise, biases, and misinformation; and 2) many remote sensing image-text datasets use VLM-generated annotations, which can introduce hallucinations and factual errors. Such statistical biases exacerbate the alignment gap of remote sensing VLMs, leading to generation bias and degraded task performance. To address this issue, we propose Geo-Contrastive Decoding (Geo-CD) to enhance remote sensing VLMs. Geo-CD reduces over-reliance on statistical biases by contrasting the output distributions produced from distorted versus original visual inputs. This strategy ensures that the generations of VLMs remain well-grounded in the visual input, thereby improving both reliability and accuracy. Extensive experiments demonstrate that Geo-CD can be applied to diverse remote sensing tasks without additional training or external tools. On the selected base VLM, Geo-CD achieves consistent gains across most remote sensing benchmarks and reaches state-of-the-art performance.
Loading