Revisiting Multi-Modal LLM Evaluation

Published: 06 Mar 2025, Last Modified: 07 Mar 2025ICLR 2025 Workshop Data Problems PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-modal large language models, De-biasing Evaluation, Visual Query Detection, Visual Question Answering, Fine-grained VQA
TL;DR: Evaluations on more reliable, fine-grained VQA datasets & tasks (TDIUC, TallyQA, DVQA, and VQDv1) correct flaws in mainstream benchmarks (e.g., VQAv2), and demonstrate weaknesses in multi-modal large language models.
Abstract: With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created (VQAv2, GQA, TextVQA et al.) and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA-OneVision, MiniGemini, CogVLM, GPT-4V et al.) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that crucially requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported.
Submission Number: 46
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview