SPOT: Structured Prompting with Object-centric Tokens for open-world scene graphs

Mengqi Zhang; Sahil Khose; Fiona Ryan; Judy Hoffman

SPOT: Structured Prompting with Object-centric Tokens for open-world scene graphs

Mengqi Zhang, Sahil Khose, Fiona Ryan, Judy Hoffman

Published: 09 May 2026, Last Modified: 09 May 2026MUSIEveryoneRevisionsCC BY 4.0

Keywords: scene graph generation, spatial reasoning

Abstract: Scene graphs provide a compact and structured representation of visual scenes by capturing objects and their relationships, making them valuable for downstream tasks in vision-language reasoning and robotics. While early work focused on closed-vocabulary settings, newer efforts have shifted toward open-world scene graph generation (SGG) to better handle diverse real-world scenarios. Recent works explore leveraging VLMs and LLMs in open-world settings for their broad, open-vocabulary knowledge. However, existing approaches often rely on proprietary models like GPT-4o and are limited by the unstructured output behavior and weak spatial and object-level reasoning capabilities of pretrained models. We introduce \ours, a structured prompting framework that augments open-source VLMs with spatial reasoning abilities for scene graph generation with minimal training. By combining object-centric visual features with the model's knowledge priors, \ours achieves competitive or superior relation prediction compared to large proprietary models. Additionally, \ours demonstrates strong cross-domain generalization, including extension to 3D scenes. Our approach is built upon open-source models, offering a scalable and accessible framework for harnessing VLMs for SGG.

Supplementary Material: pdf

Previously Accepted: No

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 15

Loading