Keywords: scene graph generation, visual relationship detection, object hierarchy
Abstract: Visual relationship, commonly defined as tuple consisting of subject, predicate and object, plays an important role in visual scene understanding, Most existing works are dedicated to generating discriminative predicate representation for the detected objects, based on their appearances, relative positions and global context. However, these global representations are inherently ambiguous and confounded, often capturing irrelevant contextual information. To address this problem, we propose to leverage object hierarchy to infer visual relationships. Our core insight is that a seemingly holistic object-level interaction can be resolved into a set of precise part-level interactions via an object hierarchy. Compared with object-level interaction, part-level interaction not only has lower visual variability, but also provides accurate guidance for model understanding predicates. To this end, we introduce Hierarchical Inference Network (HINet). Specifically, we first construct more robust and discriminative predicate representations by dynamically fusing global object-level and local part-level representations. We then design a structured constraint on predicate representations by explicitly constructing correlations between object-level and part-level interactions, thereby guiding the model to focus on the key information of the current interaction.
Through the collaborative processing of these strategies, our HINet transcends the superficial learning of visual relations from objects and predicates, adopting a structured reasoning approach to explore their essence. Experiments have demonstrated the effectiveness of our method. Furthermore, it exhibits strong versatility and can be efficiently integrated with various existing models to enhance their performance. For instance, the 1.8\% $\sim$ 6.8\% increase in mR@100 on SGGen task demonstrates this capability.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6452
Loading