Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening

Min Cao, Yang Bai, Ziqiang Cao, Liqiang Nie, Min Zhang

Published: 01 Jan 2024, Last Modified: 15 May 2025IEEE Trans. Circuits Syst. Video Technol. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Image-text retrieval is a fundamental task to model a connection between images and natural language. Under its flourishing development in performance, most current methods suffer from $N$ -related time complexity, which hinders their application in practice to a certain extent. Targeting efficiency improvement, we propose a simple and effective keyword-guided pre-screening framework for image-text retrieval. Specifically, we convert the image and text data into keywords and perform keyword matching across the modalities to exclude a large number of irrelevant gallery samples prior to the retrieval network. For the keyword prediction, we transfer it into a multi-label classification problem and propose a multi-task learning scheme by appending the multi-label classifiers to the image-text retrieval network to achieve a lightweight and high-performance keyword prediction. For keyword matching, we introduce the inverted index from the search engine and thus create a win-win situation on both time and space complexities for the pre-screening. Extensive experiments on the two widely-used datasets, i.e., Flickr30K and MS-COCO, verify the effectiveness of the proposed framework. The proposed framework equipped with only two embedding layers achieves $O(1)$ querying time complexity, while improving the retrieval efficiency and maintaining performance, when applied prior to the common image-text retrieval methods.