Scene-Aware Urban Design: A Human–AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models
Track: Paper
Keywords: urban design, participatory design, bottom-up urbanism, human-AI collaboration, human-computer interaction (HCI), vision-language models (VLMs), scene understanding, co-occurrence embeddings, open-vocabulary object detection, Grounding DINO, ADE20K, zero-shot detection, augmented reality (AR), anchor-object selection, recommendation framework, text-to-3D generation, mesh generation for AR, micro-scale interventions, civic technology, civic reporting platforms
TL;DR: From a selected anchor object, co-occurrence stats and a vision language model are used to suggest and preview in AR small, context-aware urban tactics.
Abstract: This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space and support more continuous, local participation. Using Grounding DINO and a curated subset of the ADE20K dataset as a proxy for the urban built environment, the system detects urban objects and builds co-occurrence embeddings that reveal common spatial configurations. From this analysis, the user receives five statistically likely complements to a chosen anchor object. A vision language model then reasons over the scene image and the selected pair to suggest a third object that completes a more complex urban tactic. The workflow keeps people in control of selection and refinement and aims to move beyond top-down master planning by grounding choices in everyday patterns and lived experience.
Submission Number: 221
Loading