Towards Secure Video Surveillance: A Few-Shot Spatiotemporal Perception Transformer for Unseen Behavioral Anomalies

Published: 2025, Last Modified: 04 Jan 2026AVSS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Ensuring security in surveillance systems requires accurate detection and classification of unseen behavioral anomalies with minimal labeled data. We propose Few-Shot Spatiotemporal Perception Transformer (FewShot-SPT), a novel framework that achieves this through three key innovations: (1) Event-Guided Keyframe Extraction (EGKE) dynamically selects keyframes based on anomaly intensity, reducing redundancy and boosting accuracy by 7–8%; (2) Adaptive Modality Gating (AMG) with Perceiver IO attention enables efficient multimodal fusion across video, audio, and text; and (3) Adaptive Prototypical Few-Shot Learning with contrastive learning improves generalization to unseen anomalies. Unlike prior methods that require scene-specific fine-tuning, FewShot-SPT generalizes dynamically using anomaly-aware scoring and refined prototypes. It achieves 91.6% AUC (2-way 5-shot) and 76.3% accuracy (5-way 5-shot) on UCF-Crime, and 84.2% on XD-Violence, outperforming state-of-the-art baselines. Real-world park surveillance experiments demonstrate FewShot-SPT’s robustness in detecting critical incidents such as falls, weapons, and intrusions in real-time.
Loading