MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Zhaowei Wang; Wenhao Yu; Xiyu REN; Jipeng Zhang; Yu Zhao; Rohit Saxena; Liang Cheng; Ginny Wong; Simon See; Pasquale Minervini; Yangqiu Song; Mark Steedman

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Zhaowei Wang, Wenhao Yu, Xiyu REN, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman

Published: 10 Jun 2025, Last Modified: 20 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Long Context, LVLMs, Interleaved text-image data

TL;DR: We present the first comprehensive benchmark for long-context vision langauge models.

Abstract: The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks with broad coverage of image types. To assess models on different input lengths, all examples are delivered at five standardized input lengths via a cross-modal tokenization scheme that combines vision patches and text tokens. By thoroughly benchmarking 46 LCVLMs, we find that all models face challenges in our multimodal long-context tasks. Then, we also provide a comprehensive analysis of the current models' long-context ability. With wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for developing the next generation of LCVLMs.

Submission Number: 18

Loading