Abstract: Hate speech is a pressing issue in modern society, with significant repercussions both online and offline. Recent research in hate speech detection has primarily centered on text-based media, largely overlooking multimodal content such as videos. Existing studies on hateful video datasets have predominantly focused on English content within a Western context and have been limited to binary labels (hateful or non-hateful), lacking detailed contextual information. This study presents $\textsf{MultiHateClip}$, an novel multilingual dataset curated through hate lexicons and human annotation. It aims to enhance the detection of hateful videos on platforms such as YouTube and Bilibili, encompassing content in both English and Chinese languages. Comprising 2,000 videos annotated for hatefulness, offensiveness, and normalcy, this dataset provides a cross-cultural perspective on gender-based hate speech. Through a detailed examination of human annotation results, we discuss the differences between Chinese and English hateful videos and underscore the importance of different modalities in hateful and offensive video analysis. Evaluations of state-of-the-art video classification models, such as $\textit{VLM}$ and $\textit{GPT-4V}$, on $\textsf{MultiHateClip}$ highlight the existing challenges in accurately distinguishing between hateful and offensive content and the urgent need for models that are both multimodally and culturally nuanced. $\textsf{MultiHateClip}$ serves as a foundational step towards developing more effective hateful video detection solutions, emphasizing the importance of a multimodal and culturally sensitive approach in the ongoing fight against online hate speech.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: We introduced MultiHateClip, the first multilingual dataset categorizing videos into hateful, offensive, and normal, with detailed annotations specifying hate speech segments, targeted victims, and contributing modalities. This dataset serves as a vital resource for further research. Our analysis of MultiHateClip revealed cultural-specific traits of English and Chinese language and emphasized the importance of multimodal inputs in detecting hate speech, informing the refinement of detection approaches and model development. We assessed current video classification models, identifying weaknesses in distinguishing between hateful and offensive content, handling non-Western cultural data, and integrating multimodal representations. These findings highlight gaps and suggest directions for future research in improving hateful video detection methodologies.
Submission Number: 4482
Loading