Hi-SIGIR: Hierachical Semantic-Guided Image-to-image Retrieval via Scene Graph

Published: 27 Oct 2023, Last Modified: 05 Feb 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Image-to-image retrieval, a fundamental task, aims at matching similar images based on a query image. Existing methods with convolutional neural networks are usually sensitive to low-level visual features and ignore high-level semantic relationship information. This makes retrieving complicated images with multiple objects and various relationships a significant challenge. Although some works introduce the scene graph to capture the global semantic features of the objects and their relations, they ignore the local visual representations. In addition, due to the fragility of individual modal representations, poisoning attacks in adversarial scenarios are easily achieved, hurting the robustness of the visual-guided foundation image retrieval model. To overcome these issues, we propose a novel hierarchical semantic-guided image-to-image retrieval method via scene graph, called Hi-SIGIR. Specifically, to begin with, our proposed method generates the scene graph of an image. Then, our model extracts and learns both the visual and semantic features of the nodes and relations within the scene graphs. Next, these features are fused to obtain local information and sent to the graph neural network to obtain global information. Using this information, the similarity between the scene graphs of several images is calculated at both the local and global levels to perform image retrieval. Finally, we introduce a surrogate that calculates relevance in a cross-modal manner to understand image content better. Experimental evaluations on several wildly-used benchmarks demonstrate the superiority of the proposed method.
Loading