UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark

Published: 01 Jan 2025, Last Modified: 30 Apr 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Localizing unusual activities in videos, such as abnormal behaviors or traffic incidents, holds practical significance. However, pretrained foundation models struggle with localizing diverse unusual events likely because of their insufficient representation in the models' pretraining datasets. To explore foundation models' capability in localizing unusual activities, we introduce UALBench, a comprehensive benchmark for unusual activity localization, featuring three video datasets (UAG-OOPS, UAG-SSBD, and UAG-FunQA), and an instruction-tuning dataset (OOPS-UAG-Instruct), to improve model capabilities. We also introduce a new metric, R@1, TD ≤ p, as an auxiliary metric to reasonably consider detections as true positive if their starting and ending timestamps are within a threshold. On UAL-Bench, we evaluate three approaches: Video-Language Models (VidLLMs), instruction-tuned Vid-LLMs, and a novel integration of Vision-Language Models and Large Language Models (VLM-LLM). Our results show the VLM-LLM approach excels in localizing short-span unusual events and predicting their onset (start time) more accurately than VidLLMs. Our findings highlight the challenges posed by longduration videos, particularly in autism diagnosis scenarios, and the need for further advancements in localization techniques. Our work not only provides a benchmark for unusual activity localization but also outlines the key challenges for existing foundation models, suggesting future research directions on this important task.
Loading