Parallel weight control based on policy gradient of relation refinement for cross-modal retrieval

Li Zhang, Yahu Yang, Shuheng Ge, Guanghui Sun, Xiangqian Wu

Published: 01 Jan 2024, Last Modified: 13 Nov 2024Eng. Appl. Artif. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Cross-modal retrieval has become one of the hot topics in the field of multi-modal research, which receives widespread attention. Existing approaches adopted object region (word) features to generate final image (text) global feature vector without weighting or with unsupervised attention mechanism, which lead to mistake attention or weighting of object region (word) features. In this paper, a parallel weight control method based on Policy Gradient of Relationship Refinement (PGRR) is proposed for cross-modal retrieval, which utilizes self-attention mechanism to model the relationship between any local features and all local features within the modality, thereby more accurately using discrete and continuous policy gradient to estimate the weight of this local feature in the final global feature. Furthermore, PGRR transforms the existing iterative prediction weight pattern into parallel weight control, which significantly improves the training and inference efficiency of the model. Extensive experiments on MS-COCO and Flicker30K datasets demonstrate PGRR consistently outperforms state-of-the-art methods for the image-text matching.