IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Published: 26 Jan 2026, Last Modified: 27 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image-Grounded Video Perception and Reasoning, Multimodal llms, Benchmark
TL;DR: This paper introduces IV-Bench, the first comprehensive benchmark for evaluating Multimodal Large Language Models on image-grounded video perception and reasoning.
Abstract: Current benchmarks for Multimodal Large Language Models (MLLMs) predominantly rely on text-only queries, overlooking the essential role of images as visual context for enhancing video comprehension and facilitating natural human-AI interaction. To bridge this gap, we introduce \textbf{IV-Bench}, the first comprehensive benchmark for evaluating MLLMs on Image-Grounded Video Perception and Reasoning. IV-Bench comprises 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) spanning 5 distinct categories. We extensively evaluate state-of-the-art MLLMs, including open-source models (e.g., InternVL2.5, Qwen2.5-VL) and closed-source models (e.g., GPT-4o, Gemini2.0 series), revealing substantial performance gaps, with the best-performing model achieving only 28.9\% accuracy. Ablation studies demonstrate that incorporating images significantly enhances video understanding and highlight key model design factors influencing performance. Our findings provide valuable insights and guidance for future research. The code and dataset are available at \url{https://github.com/multimodal-art-projection/IV-Bench}.
Primary Area: datasets and benchmarks
Submission Number: 10067
Loading