Disentangled OCR: A More Granular Information for "Text"-to-Image Retrieval

Xinyu Zhou, Shilin Li, Huen Chen, Anna Zhu

2022 (modified: 25 Apr 2023)PRCV (1) 2022Readers: Everyone

Abstract: Most of the previous text-to-image retrieval methods were based on the semantic matching between text and image locally or globally. However, they ignore a very important element in both text and image, i.e., the OCR information. In this paper, we present a novel approach to disentangle the OCR from both text and image, and use the disentangled information from the two different modalities for matching. The matching score is consist of two parts, the traditional global semantic text-to-image representation matching and OCR matching scores. Since there is no dataset to support the training of text OCR disentangled task, we label partial useful data from TextCaps dataset, which contains scene text images and their corresponding captions. We relabel the text of captions to OCR and non-OCR words. In total, we extract 110K captions and 22K images from TextCaps, which contain OCR information. We call this dataset TextCaps-OCR. The experiments on TextCaps-OCR and another public dataset CTC (COCO-Text Captions) demonstrate the effectiveness of disentangling OCR in text and image for cross modality retrieval task.

0 Replies