RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives

Published: 06 May 2025, Last Modified: 06 May 2025SynData4CVEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Road Event Understanding, Video Question Answering, VideoQA, Social media videos, Social commentary, Synthetic QA Generation, Diverse dataset, Dataset, Benchmark, Video Language Models, Video LLMs
TL;DR: We generate a large-scale, diverse VideoQA dataset from social media narratives and benchmark SOTA Video LLMs on their road event understanding abilities.
Abstract: We introduce **RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives**. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our **scalable semi-automatic annotation framework** leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning **14M frames** and **414K social comments**, resulting in a dataset with 13.2K videos, 674 tags and **260K high-quality QA pairs**. We **evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose)** on the RoadSocial-QA benchmark. Through fine-tuning on RoadSocial, we also demonstrate our dataset’s utility in improving road event understanding capabilities of general-purpose Video LLMs.
Submission Number: 27
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview