MIEB: Massive Image Embedding Benchmark

Published: 23 Sept 2025, Last Modified: 23 Sept 2025OpenReview Archive Direct UploadEveryoneCC BY-SA 4.0
Abstract: Image representations are often evaluated through dis- jointed, task-specific protocols, leading to a fragmented un- derstanding of model capabilities. For instance, it is un- clear whether an image embedding model adept at clus- tering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Em- bedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broad- est spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level cate- gories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vi- sion models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved en- codings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models.
Loading