Re-M: Adapting Multi-Modal Large Language Models for Zero-Shot Cross-Modal Hybrid Retrieval and Reranking

ACL ARR 2026 January Submission10661 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Modal Large Language Model, Zero-Shot Cross-Modal Retrieval, Hybrid Retrieval, Re-ranking Paradigm, Prompting Approach
Abstract: Multi-Modal Large Language Models (MLLMs) demonstrate remarkable capabilities in vision-language understanding. Leveraging MLLMs to extract features for modern retrieval pipelines, including sparse retrieval, dense retrieval, and re-ranking, offers a promising direction that eliminates the need for expensive training data. In this paper, we investigate the feasibility of extracting high-quality sparse representations from MLLMs and propose a multi-perspective prompting method to enhance representational expressivity. Furthermore, we identify a significant performance disparity between image-to-text and text-to-image tasks during the re-ranking phase, indicating the necessity for distinct strategies. Building on these insights, we introduce Re-M, a two-stage zero-shot cross-modal retrieval framework. By integrating sparse-dense hybrid retrieval with asymmetric re-ranking, Re-M achieves performance that rivals or even surpasses supervised baselines in zero-shot settings.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Submission Number: 10661
Loading