Abstract: Image representations are often evaluated through dis-
jointed, task-specific protocols, leading to a fragmented un-
derstanding of model capabilities. For instance, it is un-
clear whether an image embedding model adept at clus-
tering images is equally good at retrieving relevant images
given a piece of text. We introduce the Massive Image Em-
bedding Benchmark (MIEB) to evaluate the performance of
image and image-text embedding models across the broad-
est spectrum to date. MIEB spans 38 languages across 130
individual tasks, which we group into 8 high-level cate-
gories. We benchmark 50 models across our benchmark,
finding that no single method dominates across all task
categories. We reveal hidden capabilities in advanced vi-
sion models such as their accurate visual representation of
texts, and their yet limited capabilities in interleaved en-
codings and matching images and texts in the presence of
confounders. We also show that the performance of vision
encoders on MIEB correlates highly with their performance
when used in multimodal large language models.
Loading