Two are Better than One: Uncertainty-Aware Vision-Language Models for Video Anomaly Detection

Muchao Ye; Haomiao Ni; Xianren Zhang; Weiyang Liu; Pan He

Two are Better than One: Uncertainty-Aware Vision-Language Models for Video Anomaly Detection

Muchao Ye, Haomiao Ni, Xianren Zhang, Weiyang Liu, Pan He

15 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Anomaly Detection, Vision-Language Models, Uncertainty-Aware Mechanism

TL;DR: This paper proposes a novel uncertainty-aware framework named Una for the application of vision-language models in video anomaly detection.

Abstract: Vision-language models (VLMs) have demonstrated impressive reasoning capability in visual understanding tasks. One recent highlight of VLMs is their success in generating human-understandable explanations in video anomaly detection (VAD), which is an advanced video understanding task requiring delicate judgment on context-dependent and ambiguous video content. Representative works mainly formulate this problem as a natural language generation task conditioned on task-related prompts and visual inputs. However, under this paradigm, the input is processed segment by segment, and VLMs generate a response for each segment independently, which inevitably leads to uncertainty in their reasoning with a limited context. To bridge this fundamental gap, we propose an uncertainty-aware VLM framework named Una for VAD to objectively identify the reasoning-level uncertainty in VLMs and correspondingly mitigate it: Firstly, Una obtains relevant scenes by temporal and semantic relevance and determines the existence of uncertainty by the prediction consistency across relevant scenes. After that, collective intelligence via the cooperation of VLMs is introduced to address the uncertainty. With Una, VLMs can achieve remarkable performance and advanced explainability, surpassing task-specific methods in challenging benchmarks in the most difficult setting where instruction tuning is not allowed for the first time.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5306

Loading