FreeRet: MLLMs as Training-Free Retrievers

Yuhan Zhu; Xiangyu Zeng; Chenting Wang; Xinhao Li; Yicheng Xu; Ziang Yan; Yi Wang; Limin Wang

FreeRet: MLLMs as Training-Free Retrievers

Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Yicheng Xu, Ziang Yan, Yi Wang, Limin Wang

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Model, Multimodal Retrieval

TL;DR: We present FreeRet, a plug-and-play framework that turns any off-the-shelf MLLM into a powerful multimodal retriever

Abstract: Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: \textit{Can off-the-shelf MLLMs serve as powerful retrievers without additional training?} We present \textbf{FreeRet}, a plug‑and‑play framework that turns any MLLM into a two‑stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5781

Loading