Keywords: VLM, Evaluation, Benchmark, dataset, OOD, generalization
Abstract: Cross-modal retrieval tasks, such as image-to-text and text-to-image, are crucial for evaluating vision-language models (VLMs). State-of-the-art VLMs like CLIP and BLIP-2 achieve impressive performance on benchmarks such as MSCOCO and Flickr30K. However, due to the high similarity between evaluation datasets (e.g., Flickr30K) and fine-tuning datasets (e.g., MSCOCO), these benchmarks are insufficient for assessing the out-of-distribution (OOD) generalization capabilities of VLMs. We introduce $\textbf{WIKIDO}$ (derived from $\textbf{Wiki}$pedia $\textbf{D}$iversity $\textbf{O}$bservatory), a new benchmark featuring 384K image-text pairs, alongside carefully curated, human-verified in-distribution (ID) and OOD test sets of size 3K each. Our evaluations show that BLIP-2 achieves a zero-shot recall at 1 (R@1) of 66\% on WIKIDO's OOD test set, compared to 81\% on MSCOCO and 95\% on Flickr30K. Fine-tuning on WIKIDO yields modest improvements, further demonstrating the benchmark's utility in testing OOD generalization. Our code and benchmark datasets will be released publicly.
Submission Number: 25
Loading