Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing

Published: 13 Dec 2024, Last Modified: 19 Feb 2025Good-DataEveryoneRevisionsBibTeXCC BY 4.0
Student Lead Author Indication: Yes
Keywords: remote sensing, vision-language datasets, data quality assessment, multimodality, CLIP, foundation models
Abstract: Vision language models have achieved impressive results across various fields. However, adoption in remote sensing remains limited, largely due to the scarcity of paired image-text data. To bridge this gap, synthetic caption generation has gained interest, traditionally relying on rule-based methods that use metadata or bounding boxes. While these approaches provide some description, they often lack the depth needed to capture complex wide-area scenes. Large language models (LLMs) offer a promising alternative for generating more descriptive captions, yet they can produce generic outputs and are prone to hallucination. In this paper, we propose a new method to enhance vision-language datasets for remote sensing by integrating maps as external data sources, enabling the generation of detailed, context-rich captions. Additionally, we present methods to measure and mitigate hallucinations in LLM-generated text. We introduce fMoW-mm, a multimodal dataset incorporating satellite imagery, maps, metadata, and text annotations. We demonstrate its effectiveness for automatic target recognition in few-shot settings, achieving superior performance compared to other vision-language remote sensing datasets.
Submission Number: 33
Loading