VAR Based on Knowledge Transfer from VLMs and Video Descriptions

Emilio Vera-Cordero, David Mata-Mendoza, Gibran Benitez-Garcia, Hiroki Takahashi, Mariko Nakano

Published: 01 Jan 2026, Last Modified: 08 Jun 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Video Anomaly Recognition (VAR) is crucial for public safety, yet it remains challenging due to the coarse labels in existing datasets that hinder the learning of rich patterns. Unlike standard detection, VAR requires precise classification of complex events. In this paper, we address instance-level VAR in temporally localized events using a pre-trained Vision-Language Model (VLM). We propose a classification framework that transfers visual and textual knowledge by constructing a classification matrix based on centroids of offline description embeddings. This matrix, constructed with the VLM text encoder, is used to fine-tune the VLM visual encoder for more accurate anomaly categorization. Our video classification experiments on the UCF-Crime dataset demonstrate significant improvements over video-level training and baselines. The code is publicly available at https://github.com/jemveco/clip4var.

External IDs:doi:10.1007/978-3-032-28393-1_19