VEDJE: Video-Efficient Discriminative Joint Encoder for Scalable Video-Text Retrieval

Published: 01 Jun 2026, Last Modified: 07 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Resource-adaptive foundation model inference, Efficient multimodal retrieval, Cached joint reranking, Video-text retrieval, Quality-resource tradeoffs
TL;DR: VEDJE reframes scalable video-text reranking as cache design: store compact ordered frame-local video tokens offline, then use a 33M cross-encoder with a first-stage prior to improve retrieval without query-time visual encoding.
Abstract: We recast scalable second-stage video-text retrieval as a *cache-design* problem and present VEDJE, a cached joint reranker for video. The difficulty is that pairwise verification needs temporal evidence, but rerankers either re-encode frames per query, which is too slow at index scale, or compress each video into a single global summary, which discards the temporal cues verification depends on. VEDJE keeps the cache *video-like*: a frozen visual backbone runs once per video and writes an ordered, frame-local cache that a 33M cross-encoder reads at query time, supported by a residual first-stage score prior and a training-only future-delta loss $\mathcal{L}_{\mathrm{delta}}$ that biases tight caches toward frame-to-frame change. On MSR-VTT, MSVD, DiDeMo, and ActivityNet, VEDJE lifts R@1 over the matched first-stage retriever in both retrieval directions on all four datasets, reaching T2V R@1 of 56.5 on MSR-VTT and 56.8 on MSVD with the VideoCLIP-backed variant. VEDJE thereby keeps pairwise verification on the candidate path while moving its visual cost entirely to indexing time, where it is paid once per video and amortized across all future queries.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 139
Loading