ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks
Abstract: Recent years have witnessed remarkable progress in the development of large vision-language models (LVLMs). Benefiting from the strong language backbones and efficient cross-modal alignment strategies, LVLMs exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning. However, the capabilities of LVLMs have not been comprehensively and quantitatively evaluated. Most existing multi-modal benchmarks require task-oriented input-output formats, posing great challenges to automatically assess the free-form text output of LVLMs. To effectively leverage the annotations available and reduce the manual efforts required for constructing new benchmarks, we propose to re-formulate existing benchmarks into unified LVLM-compatible formats. Through systematic data collection and reformulation, we present ReForm-Eval benchmark, offering substantial data for evaluating various capabilities of LVLMs. Through extensive experiments and analysis in ReForm-Eval, we demonstrate the comprehensiveness and reliability of ReForm-Eval in assessing various LVLMs. Our benchmark and evaluation framework will be open-sourced as a cornerstone for advancing the development of LVLMs.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: Recently, LLM-empowered large vision-language models (LVLMs) are progressively playing a pivotal role in both creating and analyzing content within the multimedia landscape. For instance, LVLMs are not only utilized to craft engaging posts for platforms like Twitter and Weibo, but also applied to analyze advertisements, videos, and other types of multimedia contents. A reliable quantitative assessment of LVLMs is therefore necessary to help us select the most suitable models for real-world multimedia scenarios.
In this paper, we propose a automated framework to comprehensively evaluate the LVLMs. We overcome the discrepancy bewteen the existing benchmarks and the LVLMs by re-formulating data into LVLM-compatible formats. With the framework, we construct a benchmark, namely ReForm-Eval, across various multimedia scenarios and tasks. ReForm-Eval can serve as a reliable platform to quantitatively assess the capabilities of LVLMs and facilitate the development of multimedia foundation models.
Supplementary Material: zip
Submission Number: 4512
Loading