Research Area: Evaluation, LMs on diverse modalities and novel applications
Keywords: multi-modality, benchmark of long-context model, benchmark of vision-language model
TL;DR: We introduce MileBench, a new benchmark that reveals most Multimodal Large Language Models struggle with handling long context and multiple images.
Abstract: Despite the rapid progression of Multimodal Large Language Models (MLLMs) and their impressive performance on various benchmarks, the applicability of these results to real-world tasks remains uncertain.
This ambiguity primarily stems from the benchmarks' limited consideration for long-context and multi-image tasks, which are critical elements in real-world applications.
Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on time-series captioning tasks, potentially masking MLLMs' performance challenges such as hallucination in long-context situations.
To address these limitations, we introduce \textbf{\dataset}, a pioneering benchmark designed to rigorously test the \textbf{M}ult\textbf{I}modal \textbf{L}ong-cont\textbf{E}xt capabilities of MLLMs.
This benchmark comprises a mix of text and images, long contexts, multiple tasks, and tasks requiring both comprehension and generation.
We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios.
Our experimental results, garnered from testing 19 models, revealed that while closed-source model GPT-4(Vision) outperforms others, most open-source MLLMs display inadequate performance in long-context situations.
Hence, we strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 782
Loading