Abstract: Multimedia content on social media is rapidly evolving, with memes gaining prominence as a distinctive form. Unfortunately, some malicious users exploit memes to target individuals or vulnerable communities, making it imperative to identify and address such instances of hateful memes. Extensive research has been conducted to address this issue by developing hate meme detection models. However, a notable limitation of traditional machine/deep learning models is the requirement for quality labeled datasets for accurate detection. Recently, the research community has witnessed the emergence of several vision language models (VLMs) that have exhibited outstanding performance across various tasks. In this study, we aim to investigate the efficacy of open-source VLMs in handling intricate tasks such as hate meme detection in a completely zero-shot setting. In particular, we systematically study various prompt strategies using zero-shot capabilities of VLMs to detect hateful/harmful memes. Next we use a novel superpixel based occlusion technique to obtain better interpretations of the misclassification results. Finally we show that these misclassified data points nicely cluster into well-defined topics thus naturally identifying the vulnerabilities of the VLMs and paving the way to better fabrication of safety guardrails in future.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, hate-meme detection, interpretability, topic modeling, prompting
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 2202
Loading