Beyond General Alignment: Fine-Grained Entity-Centric Image-Text Matching with Multimodal Attentive Experts

Published: 01 Jan 2025, Last Modified: 12 Nov 2025SIGIR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent progress in aligning images with texts has achieved remarkable results, however, existing models tend to serve general queries and often fall short when dealing with detailed query requirements. In this paper, we work towards Entity-centric Image-Text Matching (EITM), a finer-grained image-text matching task that aligns texts and images centered around specific entities. The main challenge in EITM lies in bridging the substantial semantic gap between entity-related information in texts and images, which is more pronounced than in general image-text matching problems. To address this challenge, we adopt CLIP as our foundational model and devise a Multimodal Attentive Experts (MMAE)-based contrastive learning to adapt CLIP into an expert for EITM problem. Particularly, the core of our multimodal attentive experts learning is to generate explanation texts by Large Language Models (LLMs) as bridging clues. In specific, we first employ off-the-shelf LLMs to generate explanatory text. This text, along with the original image and text, is then fed into our Multimodal Attentive Experts module to narrow the semantic gap within a unified semantic space. Upon the enriched feature representations generated by MMAE, we have further developed an effective Gated Integrative Image-text Matching (GI-ITM) strategy. GI-ITM utilizes an adaptive gating mechanism to combine features from MMAE, followed by applying image-text matching constraints to enhance the alignment precision. Our method has been extensively evaluated on three social media news benchmarks: N24News, VisualNews, and GoodNews. The experimental results demonstrate that our approach significantly outperforms competing methods. Our code is available at: https://github.com/wangyxxjtu/ETE.
Loading