Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation

Published: 25 Mar 2025, Last Modified: 05 Mar 2025AAAI 2025EveryoneCC BY 4.0
Abstract: Open-vocabulary Scene Graph Generation (OV-SGG) over comes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary tex tual representations. This enables the identification of novel visual relationships, making it applicable to real-world sce narios with diverse relationships. However, existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment. To address these challenges, we propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text represen tation by integrating subject-object and region-specific relation information. Our approach employs entity clustering to man age the complexity of the relation triplet category space, ensur ing the practicality of incorporating subject-object information. Additionally, we utilize a large language model (LLM) to gen erate detailed region-aware prompts, capturing fine-grained visual interactions and improving alignment between visual and textual modalities. RAHP also introduces a dynamic se lection mechanism within Vision-Language Models (VLMs), which adaptively selects relevant text prompts based on the visual content, reducing noise from irrelevant prompts. Exten sive experiments on the Visual Genome and Open Images v6 datasets demonstrate that our framework consistently achieves state-of-the-art performance, demonstrating its effectiveness in addressing the challenges of open-vocabulary scene graph generation.
Loading