Keywords: Harmful Video Detection, Multimodal Dataset, Multimodal Reasoning
Abstract: Short videos have become a dominant online medium, where harmful content is increasingly conveyed implicitly and fragmentedly. Prior work largely focuses on explicit categories (e.g., violence, hate speech) and often fails to detect videos that subtly promote misleading values through narrative context, emotional framing, or cross-modal cues, partly due to the lack of dedicated benchmarks for implicit harm. To bridge this gap, we construct DeepHarm-7K, a large-scale harmful short video dataset comprising 7,110 samples with annotations guided by a fine-grained harmful content taxonomy, which systematically incorporates implicitly harmful videos across diverse real-world scenarios under multi-dimensional quality control. Building on this dataset, we propose DeepHarm-VL, a multimodal detection framework that integrates visual, audio, and cross-modal reasoning. The framework employs a two-round reasoning strategy to capture implicit semantics without task specific fine-tuning, while remaining compatible with closed source multimodal models. Experimental results show consistent improvements over strong baselines and state of the art methods, demonstrating effectiveness in detecting both explicit and implicit harmful short videos.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: video processing, cross-modal information extraction, cross-modal content generation, multimodality
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: Chinese, English
Submission Number: 2231
Loading