DPA-SGG: Dual Prompt Learning with Pseudo-Visual Augmentation for Open-Vocabulary Scene Graph Generation

Jiaming Lei; Kexin Li; Xingchen Li; Lin Li; Jun Xiao; Jun Yu; Long Chen

DPA-SGG: Dual Prompt Learning with Pseudo-Visual Augmentation for Open-Vocabulary Scene Graph Generation

Jiaming Lei, Kexin Li, Xingchen Li, Lin Li, Jun Xiao, Jun Yu, Long Chen

16 Sept 2025 (modified: 01 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: OVSGG, LLMs, Dual Prompt Learning, Pseudo-Visual Augmentation

Abstract: Open Vocabulary Scene Graph Generation (OVSGG) aims to recognize previously unseen relationships between objects in images, which is essential for robust visual understanding in dynamic real-world scenarios. Recent methods leverage prompt tuning to transfer the rich visual–semantic knowledge of pretrained Vision-Language Models (VLMs), thereby enhancing the recognition ability of unseen predicates. Typically, these methods rely solely on subject and object bounding boxes from seen relationships to extract visual features for guiding visual–semantic alignment during prompt learning. However, this paradigm may lead to two major limitation: 1) Contextual Blindness, which means models may overlook broader contextual cues by focusing only on object regions while excluding union regions, making it difficult to distinguish triplets that are visually similar but semantically distinct; 2) Limited Visual Generalization, which means models may struggle to transfer effectively to unseen predicates since the training is only restricted to annotated visual regions. To address these limitations, we propose a novel OVSGG framework, termed DPA-SGG, consisting of two key components: Dual Prompt Learning (DLP), which introduces two complementary prompts to jointly capture localized object cues and global scene context to better distinguish visually similar relationships; and Pseudo-Visual Augmentation (PVA), which enriches visual diversity by generating a corpus of textual scenes in place of costly visual annotations. Extensive experiments and ablation studies demonstrate the effectiveness of the proposed framework.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7674

Loading