Keywords: Multimodal Retrieval, Modality Shortcut, MLLMs, Modality Composition
Abstract: Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP aligning modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, it consists of a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, and a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: Multimodal retrieval, embedding model
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 4079
Loading