LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Published: 01 Jan 2025, Last Modified: 02 Mar 2025IEEE Trans. Pattern Anal. Mach. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building an LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of 13 representative LVLMs such as InstructBLIP and LLaVA, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates five categories of multimodal capabilities of LVLMs such as visual question answering and object hallucination on 42 in-domain text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study investigates how specific features of LVLMs such as model configurations, modality alignment mechanisms, and training data affect the multimodal understanding. By conducting a comprehensive comparison of these features on quantitative and arena evaluation, our study uncovers several innovative findings, which establish a fundamental framework for the development and evaluation of innovative strategies aimed at enhancing multimodal techniques.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview