RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
Keywords: multimodal large language model, real-time video analysis, video reasoning benchmark
Abstract: Multimodal Large Language Models (MLLMs) increasingly excel at perception,understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RT V-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench includes three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases.This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/RTVBench/RTV-Bench
Supplementary Material: zip
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Flagged For Ethics Review: true
Submission Number: 1114
Loading