Scene-Aware Urban Design: A Human–AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models

Rodrigo Gallardo; Alexander Htet Kyaw; Oz Fishman

Scene-Aware Urban Design: A Human–AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models

Rodrigo Gallardo, Alexander Htet Kyaw, Oz Fishman

Published: 27 Sept 2025, Last Modified: 09 Nov 2025NeurIPS Creative AI Track 2025EveryoneRevisionsBibTeXCC BY 4.0

Track: Paper

Keywords: urban design, participatory design, bottom-up urbanism, human-AI collaboration, human-computer interaction (HCI), vision-language models (VLMs), scene understanding, co-occurrence embeddings, open-vocabulary object detection, Grounding DINO, ADE20K, zero-shot detection, augmented reality (AR), anchor-object selection, recommendation framework, text-to-3D generation, mesh generation for AR, micro-scale interventions, civic technology, civic reporting platforms

TL;DR: From a selected anchor object, co-occurrence stats and a vision language model are used to suggest and preview in AR small, context-aware urban tactics.

Abstract: This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space and support more continuous, local participation. Using Grounding DINO and a curated subset of the ADE20K dataset as a proxy for the urban built environment, the system detects urban objects and builds co-occurrence embeddings that reveal common spatial configurations. From this analysis, the user receives five statistically likely complements to a chosen anchor object. A vision language model then reasons over the scene image and the selected pair to suggest a third object that completes a more complex urban tactic. The workflow keeps people in control of selection and refinement and aims to move beyond top-down master planning by grounding choices in everyday patterns and lived experience.

Submission Number: 221

Loading