TL;DR: We present SCENIR, an unsupervised scene graph retrieval framework based on Graph Autoencoders, to efficiently tackle semantic image similarity.
Abstract: Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability.
To address these, we present *SCENIR*, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches.
We further advocate for *Graph Edit Distance* (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation.
Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.
The source code is available at https://github.com/nickhaidos/scenir-icml2025.
Lay Summary: State-of-the-Art AI models often fail to grasp the true meaning of images, focusing on surface-level details like color, which leads to biased and inaccurate image search results. Many systems also need extensive, manually labeled data to learn, a slow and costly process.
Our proposed system, SCENIR, teaches computers to "see" more deeply by utilizing "scene graphs" – structured maps of objects and their relationships within an image. Scene graphs enable SCENIR to focus on the key visual information within an image – the content that is most relevant and aligned with human understanding. SCENIR uniquely learns without needing pre-labeled examples (unsupervised learning), making it more efficient. We also propose a more reliable evaluation method (Graph Edit Distance) for this specific task.
SCENIR delivers more accurate and less biased image search by understanding semantic content. It is faster than previous methods even on unannotated in-the-wild images.
Link To Code: https://github.com/nickhaidos/scenir-icml2025
Primary Area: Deep Learning->Graph Neural Networks
Keywords: Scene Graph Retrieval, Unsupervised Graph Autoencoders, Visual Semantic Similarity
Submission Number: 12072
Loading