Abstract: Highlights•In this paper, we propose a weakly supervised paradigm of cross-modal detection and consistency learning, leveraging dual consistency to provide discriminative representations for anomalies at both the semantic-to-target and target-to-snippet levels.•Specifically, we introduce a cross-modal detection network, which detects the targets in each frame according to given semantic rules, to derive semantic-consistent visual embeddings.•To depict the clear boundary between anomalies and normalities, a cross-domain alignment module is proposed to enhance the discriminative representation of abnormal targets by learning the contextual consistency between the target and snippet embeddings.•Our architecture integrates the detection of semantic-consistent targets based on variable semantic rules, ensuring transferable deployment across scenarios and enabling comprehensive identification, localization, and recognition of abnormal events through a “when-where-which” pipeline.
Loading