Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models

Bin Han, Yiwei Yang, Anat Caspi, Bill Howe

Published: 01 Jan 2024, Last Modified: 19 Dec 2024SIGSPATIAL/GIS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Equitable urban transportation applications require high-fidelity digital representations of the built environment (streets, crossings, curb ramps and more). Direct inspections and manual annotations are costly at scale, while conventional machine learning methods require substantial annotated training data for adequate performance. This study explores vision language models as a tool for annotating diverse urban features from satellite images, reducing the dependence on human annotation. Although these models excel at describing common objects in human-centric images, their training sets may lack signals for esoteric built environment features, making their performance uncertain. We demonstrate a proof-of-concept using a vision language model and a visual prompting strategy that considers segmented image elements. Experiments on two urban features --- stop lines and raised tables --- show that while zero-shot prompting rarely works, the segmentation and visual prompting strategies achieve nearly 40% intersection-over-union accuracy. We describe how these results motivate further research in automatic annotation of the built environment to improve equity, accessibility, and safety at scale and in diverse environments.