Keywords: vision-language models, CLIP, hallucination, calibration, remote sensing, nordic climate
Abstract: Automated satellite image interpretation is becoming important for climate monitoring. However, the reliability of zero-shot vision-language models (VLMs) is under-researched, especially in the context of boreal and subarctic environments. Hence, we employ CLIP, RemoteCLIP, and BLIP-ITM on three diagnostic datasets, ranging from clean baseline conditions to Nordic seasonal shift and geographic OOD shift (n\,=\,500--2000), to investigate the hallucination rate, calibration error, and the reliability of a training-free trust proxy (VTCS). Strikingly, we find that hallucination rates reach 49--89\% and in the context of Nordic seasonal shift confidence scores become \emph{anti-correlated with correctness in summer} (AUROC\,0.48, below chance). The same signal becomes slightly more reliable in winter (AUROC\,0.70). Conversely, VTCS is more stable across seasons (0.59 in summer, 0.57 in winter) and provides better overall discrimination, but it erodes under geographic shift. Subsequently, a cross-model comparison confirms that the failure is consistent across all tested architectures. Overall, these findings imply that per-image confidence thresholds are unreliable for operational Nordic deployment and season-specific recalibration is necessary. In Nordic summer conditions, filtering predictions by confidence systematically \emph{increases} error.
Submission Number: 5
Loading