Abstract: Cyberbullying has rapidly evolved with the evolution of online platforms, transcending traditional text-based forms to include images and other multimedia content. Two major challenges are identified in detecting cyberbullying images: recognizing cyberbullying-related visual factors and addressing the context-dependent nature of such images. In this paper, we conduct a comprehensive investigation of the ability of Large Vision-Language Models (LVLMs) to evaluate visual factors related to cyberbullying, and to interpret the context-dependent nature of such images. Furthermore, by proposing a diverse set of prompting strategies, we optimize LVLMs for cyberbullying image detection. In particular, through our carefully crafted Chain-of-Thought (CoT) methodology, we guide the model through structured reasoning pathways to interpret complex visual factors and account for their context. Our results show that the structured reasoning pathways significantly enhance model performance, achieving state-of-the-art accuracy and precision while remaining efficient by eliminating the need for any extensive training process.
Loading