Multi-Modal Interactive Agent Layer for Few-Shot Universal Cross-Domain Retrieval and Beyond

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Few-Shot Universal Cross Domain Retrieval, Vision-Language Models, Cross-Domain Interaction, Agent Layer
TL;DR: We introduce the few-shot universal cross-domain retrieval setup and propose a novel PEFT method, MAIL, tailored for both this task and few-shot classification.
Abstract: This paper firstly addresses the challenge of few-shot universal cross-domain retrieval (FS-UCDR), enabling machines trained with limited data to generalize to novel retrieval scenarios, with queries from entirely unknown domains and categories. To achieve this, we first formally define the FS-UCDR task and propose the Multi-Modal Interactive Agent Layer (MAIL), which enhances the cross-modal interaction in vision-language models (VLMs) by aligning the parameter updates of target layer pairs across modalities. Specifically, MAIL freezes the selected target layer pair and introduces a trainable agent layer pair to approximate localized parameter updates. A bridge function is then introduced to couple the agent layer pair, enabling gradient communication across modalities to facilitate update alignment. The proposed MAIL offers four key advantages: 1) its cross-modal interaction mechanism improves knowledge acquisition from limited data, making it highly effective in low-data scenarios; 2) during inference, MAIL integrates seamlessly into the VLM via reparameterization, preserving inference complexity; 3) extensive experiments validate the superiority of MAIL, which achieves substantial performance gains over data-efficient UCDR methods while requiring significantly fewer training samples; 4) beyond UCDR, MAIL also performs competitively on few-shot classification tasks, underscoring its strong generalization ability. Code.
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 2146
Loading