Abstract: Introduction. Cyberbullying, as a form of abusive online behavior, although not well–defined, is a repetitive process, i.e., a sequence of harassing messages sent from a bully to a victim over a period of time with the intent to harm the victim. Numerous automated, data–driven approaches have been developed for the automatic classification of cyberbullying instances, with emphasis on classification accuracy. While the importance of highly accurate classifiers is undoubted, a key pitfall of existing cyberbullying detection methods is that (i) they disregard the repetitive nature of the harassing process, and (ii) they work retrospectively (i.e., after a cyberbullying incident has occurred), making it difficult to intervene before an interaction escalates. Motivated by the scarcity of methods to anticipate cyberbullying, we focus on cyberbullying prediction with the goal of reducing the time from detection to intervention.
Methods. We formulate the prediction of the number of harassing comments a media session will receive over a period of time as a regularized multi–task regression problem. In our formulation, we consider two settings where (i) the progression of cyberbullying behavior from some time point in the near future to subsequent time points further into the future is modeled given limited knowledge of the recent past, and (ii) increasingly more historical data is accumulated to improve prediction accuracy. To validate our approach, we conduct an extensive experimental evaluation on a real–world dataset from Instagram, the online social media platform with the highest percentage of users reporting experiencing cyberbullying.
Results. Intuitively, the larger the number of observed comments in the recent past of a media session, the better the predictive power of our approach. The downside to using more historical data is that decisions must be postponed until more comments are collected. Therefore, the trade–off between accuracy and decision speed is examined. In general, our approach outperforms competing approaches by up to 31.4% and 46.2% in Recall and Mathew correlation coefficient respectively.
Discussion. Our approach can be used to effectively prioritize media sessions for increased monitoring as time goes by or for immediate intervention before a conversation escalates. In future work, we plan to incorporate additional features and investigate the generalizability of our approach on other key social networking venues where users frequently become victims of cyberbullying. Beyond cyberbullying prediction, our work is, to the best of our knowledge, the first to provide insights on the forecasting performance of multi–task regression as a function of the prediction horizon and the length of available historical data. We thus believe that our work can serve as a reference point on the forecasting performance of multi–task regression both for researchers and practitioners.
0 Replies
Loading