Overcoming the Pitfalls of Vision-Language Model for Image-Text Retrieval

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This work tackles the persistent challenge of image-text retrieval, a key problem at the intersection of computer vision and natural language processing. Despite significant advancements facilitated by large-scale Contrastive Language-Image Pretraining (CLIP) models, we found that existing methods fall short in bridging the fine-grained semantic gap between visual and textual representations, particularly in capturing the nuanced interplay of local visual details and the textual descriptions. To address the above challenges, we propose a general framework called Local and Generative-driven Modality Gap Correction (LG-MGC), which devotes to simultaneously enhancing representation learning and alleviating the modality gap in cross-modal retrieval. Specifically, the proposed model consists of two main components: a local-driven semantic completion module, which complements specific local context information that overlooked by traditional models within global features, and a generative-driven semantic translation module, which leverages generated features as a bridge to mitigate the modality gap. This framework not only tackles the granularity of semantic correspondence and improves the performance of existing methods without requiring additional trainable parameters, but is also designed to be plug-and-play, allowing for easy integration into existing retrieval models without altering their architectures. Extensive qualitative and quantitative experiments demonstrate the effectiveness of LG-MGC by achieving consistent state-of-the-art performance over strong baselines. \emph{\color{magenta}The code is included in the supplementary material.}
Relevance To Conference: This work advances multimedia/multimodal processing by introducing the Locality-Driven Modality Gap Correction (LD-MGP) approach for enhanced image-text retrieval. This novel methodology addresses the longstanding challenge of the semantic gap bridging between visual and textual modalities, a critical issue in multimedia applications. By incorporating local visual details into global representations and leveraging diffusion models to mitigate the modality gap, this approach significantly improves the semantic alignment between images and text. This contribution is pivotal for applications requiring accurate matching of visual content with descriptive text, such as digital archives, search engines, and content management systems. The integration of diffusion models showcases an innovative use of these models in multimodal contexts, opening new avenues for research and application in multimedia processing. Overall, this work enhances the performance of image-text retrieval systems and contributes to the broader field of multimodal learning, highlighting the potential for more sophisticated and accurate multimedia content analysis and generation.
Supplementary Material: zip
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Content] Vision and Language
Submission Number: 1163
Loading