IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

David Ma; Yuanxing Zhang; JinCheng Ren; Jiawei Guo; Yifan Yao; Zhenlin Wei; Zhenzhu Yang; Zhongyuan Peng; Boyu Feng; Jun Ma; 顾潇; King Zhu; Zhoufutu Wen; Yancheng He; Meng Cao; Wangchunshu Zhou; Shiwen Ni; Jiaheng Liu; Wenhao Huang; Ge Zhang; Xiaojie Jin

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image-Grounded Video Perception and Reasoning, Multimodal llms, Benchmark

TL;DR: This paper introduces IV-Bench, the first comprehensive benchmark for evaluating Multimodal Large Language Models on image-grounded video perception and reasoning.

Abstract: Current benchmarks for Multimodal Large Language Models (MLLMs) predominantly rely on text-only queries, overlooking the essential role of images as visual context for enhancing video comprehension and facilitating natural human-AI interaction. To bridge this gap, we introduce \textbf{IV-Bench}, the first comprehensive benchmark for evaluating MLLMs on Image-Grounded Video Perception and Reasoning. IV-Bench comprises 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) spanning 5 distinct categories. We extensively evaluate state-of-the-art MLLMs, including open-source models (e.g., InternVL2.5, Qwen2.5-VL) and closed-source models (e.g., GPT-4o, Gemini2.0 series), revealing substantial performance gaps, with the best-performing model achieving only 28.9\% accuracy. Ablation studies demonstrate that incorporating images significantly enhances video understanding and highlight key model design factors influencing performance. Our findings provide valuable insights and guidance for future research. The code and dataset are available at \url{https://github.com/multimodal-art-projection/IV-Bench}.

Primary Area: datasets and benchmarks

Submission Number: 10067

Loading

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

David Ma, Yuanxing Zhang, JinCheng Ren, Jiawei Guo, Yifan Yao, Zhenlin Wei, Zhenzhu Yang, Zhongyuan Peng, Boyu Feng, Jun Ma, 顾潇, King Zhu, Zhoufutu Wen, Yancheng He, Meng Cao, Wangchunshu Zhou, Shiwen Ni, Jiaheng Liu, Wenhao Huang, Ge Zhang et al. (1 additional authors not shown)