Making Archives Searchable: Vision-Language Models for Classification of Historical Aerial Imagery

Marvin Burges, Sebastian Zambanini, Robert Sablatnig

Published: 01 Jan 2024, Last Modified: 05 Mar 2025GeoSearch@SIGSPATIAL 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Historical aerial imagery archives contain valuable geospatial data for studying urban development, environmental changes, and historical events. However, the volume of data and inconsistencies in metadata and georeferencing complicate content classification. This paper explores the application of vision-language models, such as RemoteCLIP, GRAFT, GPT-4o mini, GeoCHAT, and RS-LLaVA, to automate classification and search within historical aerial imagery archives. We introduce a novel classification dataset with multiple annotated classes, including bridges, buildings, train tracks, bomb craters, and roads. Our exploratory analysis of model performance on this dataset provides initial insights into their capabilities and limitations in the historical imagery context. Automating image tagging with basic descriptors lays the groundwork for more sophisticated searches, enhancing access to cultural heritage. Future work should focus on optimizing feature map storage and scalability.