RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

ShuHang Xun; Sicheng Tao; Jungang Li; Yibo Shi; Zhixin Lin; Zhanhui Zhu; Yibo Yan; Hanqian Li; LingHao Zhang; Shikang Wang; Yixin Liu; Hanbo Zhang; Ying Ma; Xuming Hu

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

ShuHang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, LingHao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, Xuming Hu

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal large language model, real-time video analysis, video reasoning benchmark

Abstract: Multimodal Large Language Models (MLLMs) increasingly excel at perception,understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RT V-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench includes three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases.This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/RTVBench/RTV-Bench

Supplementary Material: zip

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Flagged For Ethics Review: true

Submission Number: 1114

Loading