Abstract: The rapid growth of video platforms has transformed information dissemination and led to an explosion of multimedia content. However, this widespread reach also introduces risks, as some users exploit these platforms to spread hate speech, which is often concealed through complex rhetoric, making hateful video detection a critical challenge.
Existing detection methods rely heavily on unimodal analysis or simple feature fusion, struggling to capture cross-modal interactions and reason through implicit hate in sarcasm and metaphor. To address these limitations, we propose HVGuard, the first reasoning-based hateful video detection framework with multimodal large language models (MLLMs). Our approach integrates Chain-of-Thought (CoT) reasoning to enhance multimodal interaction modeling and implicit hate interpretation. Additionally, we design a Mixture-of-Experts (MoE) network for efficient multimodal fusion and final decision-making. The framework is modular and extensible, allowing flexible integration of different MLLMs and encoders. Experimental results demonstrate that HVGuard outperforms all existing advanced detection tools, achieving an improvement of 6.88\% to 13.13\% in accuracy and 9.21\% to 34.37\% in M-F1 on two public datasets covering both English and Chinese.
Paper Type: Long
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: hate-speech detection, NLP tools for social analysis
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: English, Chinese
Submission Number: 6574
Loading