RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives
Keywords: General Road Event Understanding, Video Question Answering, Social media videos, Social commentary, Geographically Diverse dataset, VideoQA, Dataset, Benchmark, Video Language Models, Video LLMs
TL;DR: We curate a large-scale, diverse VideoQA dataset from social media video narratives and benchmark SOTA Video LLMs on their road event understanding abilities.
Abstract: We introduce **RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives**. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse.
Our **scalable semi-automatic annotation framework** leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning **14M frames** and **414K social comments** from **100 countries**, resulting in a dataset with **13.2K videos, 674 tags and 260K high-quality QA pairs**.
We **evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose)** on the RoadSocial-QA benchmark. We also demonstrate RoadSocial’s utility in improving road event understanding capabilities of general-purpose Video LLMs.
Submission Number: 3
Loading