PG-VLM: A Multi-Stage Panoptic-Graph Architecture for Detailed Visual-Linguistic Grounding in Urban Scenes
Keywords: panoptic scene graph, vision-language model, semantic triplets, paragraph generation, explainable AI, spatial reasoning, hallucination reduction, urban scene understanding, NRDS
TL;DR: PG-VLM generates spatially grounded paragraph-level descriptions of urban scenes using a panoptic scene graph, semantic triplets, and a structured-to-text generator, improving both accuracy and faithfulness over recent VLMs.
Abstract: Describing complex urban scenes with coherent paragraphs that are both semantically rich and spatially grounded is a key challenge for vision–language research. We present PG-VLM, a modular framework that (i) builds a Hierarchical Panoptic Scene Graph (HPSG) from panoptic segmentation, (ii) distills the graph into semantic triplets using a local instruction model, and (iii) generates narratives with a structured-to-text T5 generator. We assess text quality with standard captioning metrics and grounding with a new Narrative Relevance Detection Score (NRDS) that ties detection correctness to textual mention quality. On Cityscapes, PG-VLM surpasses recent vision–language baselines (BLIP-2, LLaVA-1.5 7B, SpatialVLM) across all metrics: CIDEr 135.0 (vs. 88.0/104.5/118.2), SPICE 28.8 (vs. 19.5/21.2/23.6), and BERTScore-F1 92.5 (vs. 88.0/89.0/90.1). Hallucination is reduced, with CHAIR-s 7.2 and CHAIR-i 9.5 (vs. 16.8/20.5 for BLIP-2, 13.0/16.2 for LLaVA-1.5, 11.4/14.8 for SpatialVLM). PG-VLM achieves substantially higher grounding via NRDS 0.76 compared to BLIP-2 at 0.52. A zero-shot check on BDD100K (50 images) indicates cross-dataset generalization (CIDEr 108.4, SPICE 24.1, NRDS-ZS 0.68), maintaining margins over all baselines. These results show that enforcing a symbolic bottleneck (HPSG to triplets) before generation improves both descriptive quality and faithfulness, offering a reproducible and extensible route to interpretable visual–language grounding in urban scenes.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24354
Loading