Keywords: Scene graph generation, Open-vocabulary learning, Spatial reasoning
TL;DR: This paper introduces SPOT, a novel framework that uses structured prompts and object-centric visual features to achieve superior spatial reasoning for scene graph generation.
Abstract: Scene graphs provide a compact and structured representation of visual scenes by capturing objects and their relationships, making them valuable for downstream tasks in vision-language reasoning and robotics. While early work focused on closed-vocabulary settings, newer efforts have shifted toward open-world scene graph generation (SGG) to better handle diverse real-world scenarios. Recent works explore leveraging VLMs and LLMs in open-world settings for their broad, openvocabulary knowledge. However, existing approaches often rely on proprietary models like GPT-4o and are limited by the unstructured output behavior and weak spatial and object-level reasoning capabilities of pretrained models. We introduce SPOT, a structured prompting framework that augments open-source VLMs with spatial reasoning abilities for scene graph generation with minimal training. By combining object-centric visual features with the model’s knowledge priors, SPOT achieves competitive or superior relation prediction compared to large proprietary models. Additionally, SPOT demonstrates strong cross-domain generalization, including extension to 3D scenes. Our approach is built upon open-source models, offering a scalable and accessible framework for harnessing VLMs for SGG.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4145
Loading