SPOT: Structured Prompting with Object-centric Tokens for open-world scene graphs

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Scene graph generation, Open-vocabulary learning, Spatial reasoning
TL;DR: This paper introduces SPOT, a novel framework that uses structured prompts and object-centric visual features to achieve superior spatial reasoning for scene graph generation.
Abstract: Scene graphs provide a compact and structured representation of visual scenes by capturing objects and their relationships, making them valuable for downstream tasks in vision-language reasoning and robotics. While early work focused on closed-vocabulary settings, newer efforts have shifted toward open-world scene graph generation (SGG) to better handle diverse real-world scenarios. Recent works explore leveraging VLMs and LLMs in open-world settings for their broad, openvocabulary knowledge. However, existing approaches often rely on proprietary models like GPT-4o and are limited by the unstructured output behavior and weak spatial and object-level reasoning capabilities of pretrained models. We introduce SPOT, a structured prompting framework that augments open-source VLMs with spatial reasoning abilities for scene graph generation with minimal training. By combining object-centric visual features with the model’s knowledge priors, SPOT achieves competitive or superior relation prediction compared to large proprietary models. Additionally, SPOT demonstrates strong cross-domain generalization, including extension to 3D scenes. Our approach is built upon open-source models, offering a scalable and accessible framework for harnessing VLMs for SGG.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4145
Loading