OpenCityCorpus: A Large-Scale, Harmonized, and LLM-Ready Corpus of Urban Data for Scientific Research

Published: 24 Sept 2025, Last Modified: 26 Dec 2025NeurIPS2025-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 2: Dataset Proposal Competition
Keywords: open urban data, schema harmonization, large language models, retrieval-augmented generation
TL;DR: We propose OpenCityCorpus, a novel, large‑scale (~200 GB) dataset of harmonized urban data aggregated from over 200 cities for AI training and retrieval‑augmented generation.
Abstract: We propose $\textit{OpenCityCorpus}$, an openly shareable, large-scale corpus that harmonizes public urban data from 200+ cities across Socrata, ArcGIS, and CKAN portals into a unified schema and an LLM-ready text representation. Fragmentation across municipal platforms has long impeded rigorous, cross-city science on climate, mobility, governance, and public health. Our dataset resolves schema heterogeneity, standardizes types and coordinate systems, and converts rows into semantically consistent factual statements, enabling retrieval-augmented generation, hypothesis testing, and transfer learning. The resource targets three AI-for-Science tasks: cross-domain scientific reasoning over coupled urban systems, surrogate modeling that complements physics-based simulators, and robust evaluation of tool-augmented LLM agents. We detail a feasible, privacy-preserving data-creation pathway, outline cost- and scale-aware operations for continuous refresh, and describe benchmarks designed to expose both the reach and the limits of current AI methods. By turning fragmented open portals into a single scientific substrate, $\textit{OpenCityCorpus}$ lowers barriers to high-impact, reproducible discovery.
Submission Number: 461
Loading