Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing

Madeline Loui Anderson; Miriam Cha; William T. Freeman; J. Taylor Perron; Nathaniel Maidel; Kerri Cahoy

Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing

Madeline Loui Anderson, Miriam Cha, William T. Freeman, J. Taylor Perron, Nathaniel Maidel, Kerri Cahoy

Published: 13 Dec 2024, Last Modified: 19 Feb 2025Good-DataEveryoneRevisionsBibTeXCC BY 4.0

Student Lead Author Indication: Yes

Keywords: remote sensing, vision-language datasets, data quality assessment, multimodality, CLIP, foundation models

Abstract: Vision language models have achieved impressive results across various fields. However, adoption in remote sensing remains limited, largely due to the scarcity of paired image-text data. To bridge this gap, synthetic caption generation has gained interest, traditionally relying on rule-based methods that use metadata or bounding boxes. While these approaches provide some description, they often lack the depth needed to capture complex wide-area scenes. Large language models (LLMs) offer a promising alternative for generating more descriptive captions, yet they can produce generic outputs and are prone to hallucination. In this paper, we propose a new method to enhance vision-language datasets for remote sensing by integrating maps as external data sources, enabling the generation of detailed, context-rich captions. Additionally, we present methods to measure and mitigate hallucinations in LLM-generated text. We introduce fMoW-mm, a multimodal dataset incorporating satellite imagery, maps, metadata, and text annotations. We demonstrate its effectiveness for automatic target recognition in few-shot settings, achieving superior performance compared to other vision-language remote sensing datasets.

Submission Number: 33

Loading