Keywords: Video Anomaly Detection, Vision-Language Models, DUal-Encoder Architecture, Parameter-Efficient Fine-Tuning
TL;DR: Dual-encoder fusion of CLIP with an anomaly-trained ViT surpasses the base VLM on Top-2/Top-3 surveillance anomaly classification, while a Top-1 and caption-level calibration gap identifies the fusion-to-generation interface as the bottleneck.
Abstract: Vision-Language Models (VLMs) have recently been explored for Video Anomaly Detection (VAD) to provide natural-language explanations of anomalous events. In this work, we investigate a dual-encoder architecture that pairs a general-purpose CLIP vision encoder with a Vision Transformer (ViT) trained under Multiple Instance Learning (MIL) to inject anomaly-specific knowledge at the visual encoding stage, while keeping the Large Language Model (LLM) frozen and applying only LoRA-based fine-tuning to keep training costs low. A closed-set classification probe provides direct evidence that this design succeeds at the representation level: the dual-encoder variant surpasses the base VLM on Top-2 (0.534 vs.\ 0.525) and Top-3 (0.636 vs.\ 0.613) anomaly classification, showing that MIL-ViT injection contributes additional discriminative signal beyond CLIP alone. At Top-1 and at the caption level, the picture is more constrained: the model partially recovers the Sentence-BERT (SBERT) drop caused by LoRA fine-tuning yet still tends toward stereotyped templates, and trails the base model on Top-1 classification. We characterize this pattern as a calibration gap---the correct anomaly class is reliably available among the top candidates but is not consistently surfaced as the most probable token---rather than a loss of class information.
These results identify the fusion-to-generation interface as the primary bottleneck, and we discuss two concrete paths forward (explicit alignment pretraining and late-fusion strategies) that may close this gap.
Submission Number: 63
Loading