Re-M: Adapting Multi-Modal Large Language Models for Zero-Shot Cross-Modal Hybrid Retrieval and Reranking

Re-M: Adapting Multi-Modal Large Language Models for Zero-Shot Cross-Modal Hybrid Retrieval and Reranking

ACL ARR 2026 January Submission10661 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Modal Large Language Model, Zero-Shot Cross-Modal Retrieval, Hybrid Retrieval, Re-ranking Paradigm, Prompting Approach

Abstract: Multi-Modal Large Language Models (MLLMs) demonstrate remarkable capabilities in vision-language understanding. Leveraging MLLMs to extract features for modern retrieval pipelines, including sparse retrieval, dense retrieval, and re-ranking, offers a promising direction that eliminates the need for expensive training data. In this paper, we investigate the feasibility of extracting high-quality sparse representations from MLLMs and propose a multi-perspective prompting method to enhance representational expressivity. Furthermore, we identify a significant performance disparity between image-to-text and text-to-image tasks during the re-ranking phase, indicating the necessity for distinct strategies. Building on these insights, we introduce Re-M, a two-stage zero-shot cross-modal retrieval framework. By integrating sparse-dense hybrid retrieval with asymmetric re-ranking, Re-M achieves performance that rivals or even surpasses supervised baselines in zero-shot settings.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: image text matching

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Submission Number: 10661

Loading