Scene Text Involved "Text"-to-Image Retrieval through Logically Hierarchical MatchingDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 08 Oct 2023ICME 2023Readers: Everyone
Abstract: Text-to-image retrieval, one of the most important cross-modality tasks, aims to search the most relevant images through a given text query. Most recent approaches are based on large-scale models. The huge time costs make it impossible for real-time searching. They also ignore the fine-grained information, i.e., the scene text. To tackle these issues, we propose a novel matching method that considers the scene text in both modalities and adopts a fast matching way by aligning from objects to relations and finally to the global. This logically hierarchical process emulates the way humans understand information. To better implement our method, we relabel the TextCaps-OCR dataset, which contains 110K captions with word-level POS labeling and 22K corresponding scene text images with bounding boxes. Extensive experiments demonstrate the superiority and efficiency of our method, whose performance is significantly higher than the past SOTA on both OCR-contained and OCR-free datasets.
0 Replies

Loading