VALSE: A Task-Independent Benchmark for Vision and Language Models centered on Linguistic PhenomenaDownload PDF


17 Aug 2021 (modified: 05 May 2023)ACL ARR 2021 August Blind SubmissionReaders: Everyone
Abstract: We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for specific visio-linguistic grounding capabilities. Currently, V&L models are evaluated on tasks such as visual question answering or visual reasoning, which do not address their fine-grained linguistic capabilities. VALSE addresses this gap by offering a suite of six tests targeting specific linguistic phenomena. Solving these tests requires models to ground these phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of reliable foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.
0 Replies
