Keywords: Vision-Language Models, Multimodal Information Retrival, Representation learning
TL;DR: We present LEMUR, a VLM-based framework for fine-grained retrieval that incorporates a Region-Aware Encoder, along with the FGMB benchmark. LEMUR achieves up to 20% improvement in fine-grained retrieval.
Abstract: Fine-grained multimodal retrieval is crucial for many real-world applications. For example, E-commerce product search demands retrieving the product with the most relevant image and description based on specific regions of the query image. However, existing CLIP-based or VLM-based retrieval methods primarily focus on image-level tasks and struggle with region-level applications. In this work, we present LEMUR, a VLM-based fine-grained retrieval framework that enhances the regional representations without compromising its image-level retrieval performance. At its core, LEMUR incorporates a Region-Aware Encoder that extracts detailed features from query regions to complement the global image representation. To further enhance fine-grained retrieval capability, we integrate detailed localized captioning and regional contrastive learning tasks, which strengthen the model's fine-grained understanding and representation. In addition, considering the limitations of existing benchmarks, such as the absence of region-level contrastive pairs and the limited diversity of evaluation tasks, we introduce the FGMB benchmark. It comprises 225k contrastive pairs, covering two metatasks and four multimodal retrieval scenarios. Extensive experiments validate the effectiveness of our approach. LEMUR generally outperforms strong baselines in zero-shot settings. Further training with regional contrastive learning leads to an average improvement of 20% in fine-grained retrieval performance, while achieving comparable or better results in image-level retrieval tasks. The code and data will be released to facilitate future research.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1273
Loading