From Walled Gardens to Open Streets: A Pipeline for Cross-City Data Harmonization

Published: 30 Sept 2025, Last Modified: 24 Nov 2025urbanai PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: open urban data, schema harmonization, large language models, retrieval-augmented generation
TL;DR: We propose OpenCityPipeline, a novel workflow that harmonizes urban data from Socrata, ArcGIS, and CKAN.
Abstract: We present $\textit{OpenCityPipeline}$, a compact, end-to-end workflow that turns fragmented municipal open data into a unified, semantically enriched resource suitable for efficient model training. Urban data is severely fragmented across disparate platforms (e.g., Socrata, ArcGIS, CKAN), hindering holistic analysis and large-scale research. Our pipeline implements platform-aware ingestion, schema harmonization, targeted cleaning, redundancy control, and an optional data-to-text layer that renders structured records directly consumable by modern retrieval and language models. We describe how the workflow curates what cities already publish into higher-value training material and an indexable evidence base. The design aligns with efforts in curated data for efficient learning by reducing integration overhead, removing redundancy, and surfacing representative, auditable samples for downstream tasks.
Submission Number: 40
Loading