SGG-MVAR: Cross-Modal Retrieval With Scene Graph Generation and Multiview Attribute Relationship Guidance
Abstract: Cross-modal retrieval is crucial for achieving accurate and efficient information retrieval by establishing semantic correlations between heterogeneous images and text. However, traditional image-text training sets suffer from information asymmetry, which includes short lengths and limited sentence structures. This phenomenon often results in insufficient representations of essential visual information. We introduce RichDataset, which offers extensive semantic information. It includes diverse real-life image-text pairs and AI-generated content across domains such as news, entertainment, education, and posters. Compared with classic benchmarks such as Flickr30k and MS-COCO, RichDataset exhibits a novel and balanced distribution. Existing cross-modal retrieval models face challenges in extracting distinct features from the emerging data, leading to low retrieval accuracy. We propose SGG-MVAR, a comprehensive retrieval model guided by multiview scene information and semantic relationships. Leveraging a scene knowledge database, our model parses scene graphs and identifies differences in attributes and relationships. We conduct extensive experiments to evaluate our proposed dataset and model. All experimental results consistently demonstrate a significant improvement in recall for cross-modal retrieval.
External IDs:dblp:journals/tcss/WangZYST25
Loading