V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Keywords: LVLMs, Video Understanding, Visual Prompting
TL;DR: A Video-Language Understanding Benchmark for Evaluating LVLMs on Human-Model Interaction through Visual Prompts
Abstract: Large Vision-Language Models (LVLMs) have made significant strides in the field of video understanding in recent times. Nevertheless, existing video benchmarks predominantly rely on text prompts for evaluation, which often require complex referential language. To address this limitation, we propose V2P-Bench, a robust and comprehensive benchmark for evaluating the ability of LVLMs to understand Video Visual Prompts in human–model interaction scenarios. V2P-Bench consists of 980 videos and 1172 well-structured high-quality QA pairs, each paired with manually annotated visual prompt frames. The benchmark spans three main tasks and twelve categories, thereby enabling fine-grained, instance-level evaluation. Through an in-depth analysis of current LVLMs, we identify several key findings: 1) Visual prompts are both more model-friendly and user-friendly in interactive scenarios than text prompts, leading to significantly improved model performance and enhanced user experience. 2) Models are reasonably capable of zero-shot understanding of visual prompts, but struggle with spatiotemporal understanding. Even o1 achieves only 71.8%, far below the human expert score of 88.3%, while most open-source models perform below 60%. 3) LVLMs exhibit pervasive hack phenomena in video question answering, which intensify with longer videos and lower frame sampling density, artificially inflating performance scores. We anticipate that V2P-Bench will not only shed light on these challenges but also serve as a foundational tool for advancing human–model interaction. The code and datasets are available at https://github.com/gaotiexinqu/v2p-bench.
Primary Area: datasets and benchmarks
Submission Number: 8969
Loading