Abstract: Open-vocabulary Scene Graph Generation (OV-SGG) over
comes the limitations of the closed-set assumption by aligning
visual relationship representations with open-vocabulary tex
tual representations. This enables the identification of novel
visual relationships, making it applicable to real-world sce
narios with diverse relationships. However, existing OV-SGG
methods are constrained by fixed text representations, limiting
diversity and accuracy in image-text alignment. To address
these challenges, we propose the Relation-Aware Hierarchical
Prompting (RAHP) framework, which enhances text represen
tation by integrating subject-object and region-specific relation
information. Our approach employs entity clustering to man
age the complexity of the relation triplet category space, ensur
ing the practicality of incorporating subject-object information.
Additionally, we utilize a large language model (LLM) to gen
erate detailed region-aware prompts, capturing fine-grained
visual interactions and improving alignment between visual
and textual modalities. RAHP also introduces a dynamic se
lection mechanism within Vision-Language Models (VLMs),
which adaptively selects relevant text prompts based on the
visual content, reducing noise from irrelevant prompts. Exten
sive experiments on the Visual Genome and Open Images v6
datasets demonstrate that our framework consistently achieves
state-of-the-art performance, demonstrating its effectiveness
in addressing the challenges of open-vocabulary scene graph
generation.
Loading