MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

ACL ARR 2026 January Submission4079 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Retrieval, Modality Shortcut, MLLMs, Modality Composition

Abstract: Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP aligning modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, it consists of a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, and a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.

Paper Type: Long

Research Area: Information Extraction and Retrieval

Research Area Keywords: Multimodal retrieval, embedding model

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 4079

Loading