WikiScenes with Descriptions: Aligning Paragraphs and Sentences with Images in Wikipedia Articles

Published: 2024, Last Modified: 21 Dec 2025*SEM@NAACL 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Research in Language & Vision rarely uses naturally occurring multimodal documents as Wikipedia articles, since they feature complex image-text relations and implicit image-text alignments. In this paper, we provide one of the first datasets that provides ground-truth annotations of image-text alignments in multi-paragraph multi-image articles. The dataset can be used to study phenomena of visual language grounding in longer documents and assess retrieval capabilities of language models trained on, e.g., captioning data. Our analyses show that there are systematic linguistic differences between the image captions and descriptive sentences from the article’s text and that intra-document retrieval is a challenging task for state-of-the-art models in L&V (CLIP, VILT, MCSE).
Loading