Abstract: Screening similar but non-target images in text-based image retrieval is crucial for pinpointing the user's desired images accurately. However, conventional methods mainly focus on enhancing text-image matching performance, often failing to identify images that exactly match the retrieval intention because of the query quality. User-provided queries frequently lack adequate information for screening similar but not target images, especially when the target database (DB) contains numerous similar images. Therefore, a novel approach is needed to extract valuable information from users for effective screening. In this paper, we propose a DB question generation (DQG) model to enhance exact cross-modal image retrieval performance. Our DQG model learns to generate effective questions that precisely screen similar but non-target images using DB contents information. By answering the questions generated from our model, users can reach their desired images by only answering the presented questions even within DBs with similar content. Experimental results on publicly available datasets show that our proposed approach can significantly improve exact cross-modal image retrieval performance. Code is available in the supplemental materials and will be publicly available.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Our work contributes to multimedia/multimodal processing field in the pespective of the text-basd image retrieval improvement. The task of our text-based image retrieval treats both vision and language information, and its combination is the recent active task in multimedia/multimodal processing. Also, we improve the retrieval performance by effectively utilizing visual question answering and image captioning models, and its utiliation culltivates the application of multimodal generative models.
Supplementary Material: zip
Submission Number: 4139
Loading