Keywords: Multi-modal benchmarks, Vision-language models, Aerial imagery, Urban change detection, Temporal localization
TL;DR: We introduce StreetTransformer, a city-scale benchmark fusing imagery, features, and civic documents to evaluate LLM models on detecting and describing urban streetscape change
Abstract: As the nodes of transportation networks, where modal and directional flows converge, intersections are inherently complex compositions of myriad infrastructure design choices over time. Yet, benchmarks that test machine or LLM understanding of street-level infrastructure and its longitudinal changes are scarce. We present StreetTransformer, a large scale multimodal corpus and benchmark of New York City streetscape change from 2006 to 2024. It links orthorectified aerial imagery, time-aware planimetric features, capital project records, and thousands of annotated civic design documents, covering about 47,000 intersections, 470,000 aerial snapshots, and more than 33,000 document pages with cross modal links to project records. As a benchmark, we define five tasks for LLMs in change- and feature-detection, document-to-image linking, and temporal-localization, with optional semantic segmentation masks. We find that GPT baselines detect change but struggle with temporal localization and reasoning across sources, which motivates task-specific training and ontology based evaluation.
Submission Number: 60
Loading