TUMTraf VideoQA: Dataset and Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

Xingcheng Zhou; Konstantinos Larintzakis; Hao Guo; Walter Zimmer; Mingyu Liu; Hu Cao; Jiajie Zhang; Venkatnarayanan Lakshminarasimhan; Leah Strand; Alois Knoll

TUMTraf VideoQA: Dataset and Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkatnarayanan Lakshminarasimhan, Leah Strand, Alois Knoll

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce a large video-language dataset for unified complex roadside traffic scenarios understanding.

Abstract: We present TUMTraf VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraf VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding—within a cohesive evaluation framework. We further introduce the TraffiX-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset’s complexity, highlight the limitations of existing models, and position TUMTraf VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.

Lay Summary: Understanding complex traffic scenes is essential for developing intelligent transportation systems. Yet, most existing AI benchmarks focus on either simple driving environments or isolated tasks, limiting progress in real-world applications. To address this gap, we introduce TUMTraf VideoQA, a new dataset featuring 1,000 real-world roadside videos and over 85,000 multiple-choice questions. It includes detailed annotations for describing and locating objects over time and uniquely combines three tasks: video question answering, referred object captioning, and spatio-temporal grounding within one benchmark. We also present TraffiX-Qwen, a strong baseline that performs well on these tasks and reveals key limitations of current models, particularly in fine-grained spatio-temporal reasoning. TUMTraf VideoQA provides a challenging, unified benchmark to drive next-generation models in traffic understanding and intelligent transportation systems.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://traffix-videoqa.github.io/

Primary Area: Applications->Computer Vision

Keywords: Traffic Scene Understanding, Video Understanding, Vision Language Model, Spatial Temporal Reasoning, Autonomous Driving

Submission Number: 5092

Loading