Attribute-Centric Representation Learning for Interpretable Crime Scene Analysis in Video Anomaly Detection

Attribute-Centric Representation Learning for Interpretable Crime Scene Analysis in Video Anomaly Detection

ICLR 2026 Conference Submission25128 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Crime Scene Analysis, Video Anomaly Detection, Explainable AI, Visual Language Reasoning

TL;DR: The paper proposes an attribute-centric framework for crime scene analysis in video anomaly detection by augmenting an existing crime dataset with attribute-level annotations and attribute-enriched captions created using large language models.

Abstract: Automatic crime scene analysis is an important application area for representation learning in Video Anomaly Detection (VAD). Effective interpretation of anomalous events requires models to learn rich, disentangled representations that capture fine-grained, crime-relevant attributes. However, widely used VAD datasets (e.g., UCA, CUVA) primarily offer coarse event-level labels and they lack attribute-level supervision often needed for modeling crime-specific behaviors. To bridge this gap, we propose an attribute-centric learning framework that explicitly conditions video representations on crime-causing attributes. We extend the UCA dataset with over 1.5M new attribute-centric annotations generated using carefully designed prompts and LLMs. These annotations enable supervised fine-tuning of a curated CLIP-based model, leading to more discriminative, attribute-aware video representations, and precise event captions. An LLM-based summarizer then distills these captions into context-rich explanations, facilitating interpretable scene understanding. Our approach answers three core questions in crime scene analysis: \textbf{What? When? How?} Extensive experiments show that the proposed representation learning framework yields significant improvements ($\approx 20\%\uparrow$) in attribute-centric crime classification accuracy and ($\approx 6.4\%\uparrow$) according to MMEval scores over the baselines. We further analyze and mitigate biases in MMEval to ensure robustness and fair evaluation. These results highlight the importance of attribute-conditioned representation learning for interpretable and reliable VAD.

Primary Area: interpretability and explainable AI

Submission Number: 25128

Loading