Keywords: video anomaly detection, multi-modal large language models, zero-shot, real-time
TL;DR: Flashback is a zero-shot, real-time, and explainable VAD system that retrieves from an offline caption memory with lightweight bias controls and runtime encoder selection.
Abstract: Video anomaly detection (VAD) aims to identify unusual events in continuous video streams, yet most existing systems either rely on domain-specific retraining or fail to meet strict real-time demands. We present **Flashback**, a zero-shot and real-time paradigm that reframes VAD as retrieval over an offline pseudo-scene memory. Inspired by how humans recall past experiences to judge the present, Flashback constructs a large set of normal and anomalous captions entirely offline with a language model, embeds them once with a frozen video-text encoder, and reuses this memory online. At inference, each segment is matched against the memory to produce both an anomaly score and a textual rationale, eliminating all online LLM calls and sustaining per-segment deadlines. Three lightweight controls improve robustness: _repulsive prompting_ separates normal and anomalous caption spaces, _scaled anomaly penalization_ corrects residual anomaly bias, and _certainty-driven runtime encoder selection_ maintains weakly-hard real-time guarantees by allocating extra compute only to difficult segments. On UCF-Crime and XD-Violence, Flashback achieves 87.7 AUC and 75.0 AP, outperforming prior zero-shot methods while providing human-readable explanations at up to 43.8 fps on a single consumer GPU. The result is the first VAD system that is simultaneously zero-shot, real-time, and explainable.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10801
Loading