Can Large Vision-Language Models Refuse Synthetic Images in Geo-localization?

Biao Wu; Tengfei Cheng; Meng Fang; Ling Chen

Can Large Vision-Language Models Refuse Synthetic Images in Geo-localization?

Biao Wu, Tengfei Cheng, Meng Fang, Ling Chen

11 Sept 2025 (modified: 25 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Geo-localization, Safety, Vision-Language Models

Abstract: Large Vision-Language Models (VLMs) have recently achieved remarkable progress in image-based geo-localization, yet face a critical safety vulnerability: they confidently predict locations for synthetic or manipulated images with no real-world correspondence. Such “refusal failures” threaten navigation, emergency response, and geographic information integrity. To address this gap, we introduce GeoSafety-Bench, the first benchmark specifically designed to evaluate geo-safety awareness in localization systems. It contains 5,997 images spanning authentic photos and four synthesis paradigms—3D rendering, text-to-image generation, image-to-image modification, and instruction-guided viewpoint synthesis. We define two key evaluation metrics, refusal failure and over-safety, to quantify the trade-off between utility and safety. Extensive experiments across retrieval-based methods, domain-specific models, and state-of-the-art VLMs reveal that while models achieve strong accuracy on authentic images, they almost universally fail to reject synthetic ones, particularly under instruction-guided generation. We also provide an illustrative baseline to show that safety-aware training can improve refusal robustness. GeoSafety-Bench thus provides a rigorous foundation for developing and evaluating trustworthy geo-localization models.

Supplementary Material: pdf

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 4175

Loading