Vigen500k: A Sustainable-Expansion Image-Text Aligned Dataset For Remote Sensing

Boyuan Tong, Runyan Du, Wenkai Zhang, Jihao Li, Shuoke Li, Chongyang Li, Zhi Guo, Xian Sun, Guangluan Xu

Published: 2024, Last Modified: 14 Nov 2024IGARSS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, large-scale Vision-Language Models (VLMs) have gained widely attention in the field of remote sensing. However, the researching on VLM requires a substantial amount of data, which is relatively scarce in the remote sensing domain. To overcome this limitation, in this paper, we present ViGen500K, a larger and more challenging image-text dataset. Nearly 500,000 images have been collected, accompanied by over 1 million annotations to adapt to the diverse requirements of various image-text tasks in remote sensing. Besides, a promising, efficient, low-cost, and highly automated data annotation method is proposed to make our dataset could be easily extensive by keeping adding extra unlabeled remote sensing images. Theoretically, ViGen500K is an infinitely large dataset. From a quantitative point of view, compared with traditional image caption datasets, ViGen500K not only has more images but also covers more object categories, which enables the model trained on our dataset could have a wider range of target-text alignment capabilities. Several experiments have been conducted to provide benchmarks for our dataset.