everyone
since 10 Jun 2025">EveryoneRevisionsBibTeXCC BY 4.0
The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks with broad coverage of image types. To assess models on different input lengths, all examples are delivered at five standardized input lengths via a cross-modal tokenization scheme that combines vision patches and text tokens. By thoroughly benchmarking 46 LCVLMs, we find that all models face challenges in our multimodal long-context tasks. Then, we also provide a comprehensive analysis of the current models' long-context ability. With wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for developing the next generation of LCVLMs.