CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval

ICLR 2026 Conference Submission15879 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal retrieval, text-video retrieval, RAG
Abstract: Online video content is richly multimodal: a single video might blend vision, speech, ambient audio, and on-screen text. Conventional retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar results. In this work, we explore multimodal video content retrieval, where relevance can be scored from a single modality or jointly across multiple modalities. Consequently, an effective retriever must dynamically determine which modality (or set of modalities) best address a given query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes four modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities within a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, to overcome the lack of suitable training data, we introduce MultiVent 2.0++, a large-scale synthetic dataset built on MultiVent 2.0 (a collection of event-centric videos in various languages paired with English queries) with modality-targeted queries to teach modality selection. Next, we propose a modality-aware contrastive loss that trains the model on both a standard contrastive objective and an objective for learning correct modality usage. On the test sets of MultiVent 2.0++ and MSRVTT, we observe that conventional aggregation strategies, such as averaging similarities for baseline retrievers, often degrade performance by introducing noise from irrelevant modalities. In contrast, CLaMR consistently outperforms existing retrievers: on MultiVent 2.0++, CLaMR improves nDCG@10 by 25.6 points over the best-performing single-modality retriever and by 35.4 points over the best-performing multi-modality retriever. We illustrate the downstream utility of CLaMR with experiments on long-video QA, where it improves performance by 3.50% over LanguageBind on Video-MME and 1.42% over dense frame sampling on LongVideoBench.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15879
Loading