LEMUR: Leveraging Vision-Language Models for Fine-Grained Multimodal Retrieval

Xun Liang; Honghui Yang; Weihang Pan; Boyuan Pan; Yao Hu; Binbin Lin; Deng Cai

LEMUR: Leveraging Vision-Language Models for Fine-Grained Multimodal Retrieval

Xun Liang, Honghui Yang, Weihang Pan, Boyuan Pan, Yao Hu, Binbin Lin, Deng Cai

03 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Multimodal Information Retrival, Representation learning

TL;DR: We present LEMUR, a VLM-based framework for fine-grained retrieval that incorporates a Region-Aware Encoder, along with the FGMB benchmark. LEMUR achieves up to 20% improvement in fine-grained retrieval.

Abstract: Fine-grained multimodal retrieval is crucial for many real-world applications. For example, E-commerce product search demands retrieving the product with the most relevant image and description based on specific regions of the query image. However, existing CLIP-based or VLM-based retrieval methods primarily focus on image-level tasks and struggle with region-level applications. In this work, we present LEMUR, a VLM-based fine-grained retrieval framework that enhances the regional representations without compromising its image-level retrieval performance. At its core, LEMUR incorporates a Region-Aware Encoder that extracts detailed features from query regions to complement the global image representation. To further enhance fine-grained retrieval capability, we integrate detailed localized captioning and regional contrastive learning tasks, which strengthen the model's fine-grained understanding and representation. In addition, considering the limitations of existing benchmarks, such as the absence of region-level contrastive pairs and the limited diversity of evaluation tasks, we introduce the FGMB benchmark. It comprises 225k contrastive pairs, covering two metatasks and four multimodal retrieval scenarios. Extensive experiments validate the effectiveness of our approach. LEMUR generally outperforms strong baselines in zero-shot settings. Further training with regional contrastive learning leads to an average improvement of 20% in fine-grained retrieval performance, while achieving comparable or better results in image-level retrieval tasks. The code and data will be released to facilitate future research.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 1273

Loading