What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and BiasesDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Vision-language (VL) models pretrained on colossal image-text datasets have attained broad VL competence, which is difficult to evaluate. A common belief is that a small number of VL skills underlie the variety of VL tests. In this paper, we perform a large-scale transfer learning experiment aimed at discovering latent VL skills from data. We reveal interesting characteristics that have important implications to test suite design. First, generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with different output lengths. Second, we demonstrate that factor analysis successfully identifies reasonable yet surprising VL skill factors, suggesting that VL test suites should consider similar analysis. Finally, we present a new dataset, OLIVE, which simulates user instructions in the wild and presents a unique challenge dissimilar to all datasets we tested. Our findings contribute to the design of balanced and broad-coverage vision-language evaluation methods.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview