Keywords: Multimodal Retrieval, Vision–Language Models, Joint Encoding, Efficient Re-ranking, Token Compression
TL;DR: We propose EDJE, an efficient vision–language joint encoder with token-compression that enables fast multimodal re-ranking, achieving up to 53× higher throughput while matching the accuracy of prior joint encoders.
Abstract: Multimodal retrieval still leans on embedding-based models like CLIP for fast
vector search over pre-computed image embeddings. Yet, unlike text retrieval
where joint-encoder rerankers are standard, comparable vision–language rerankers
are largely absent. We find that seminal joint encoders such as BLIP are severely
bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale.
Motivated by this bottleneck, we introduce EDJE , an
Efficient Discriminative Joint Encoder that precomputes vision tokens offline and
compresses them via a lightweight attention-based adapter, so online inference runs
only a compact joint encoder over a small set of visual tokens plus the text. EDJE
preserves strong retrieval performance while drastically reducing storage and online
compute, enabling high-throughput inference. Specifically, EDJE processes 50k
image–text pairs/second while requiring 49kB of disk storage per image, matching
prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3764
Loading