Offensive Content Detection Via Synthetic Code-Switched Text Download PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: The prevalent use of offensive content in social media has become an important reasonfor concern for online platforms (customer service chat-boxes, and social media platforms). Classifying offensive and hate-speech content in online settings is an essential task in many applications that needs to be addressed accordingly. However, online text from online platforms can contain code-switching, a combination of more than one language. The non-availability of labeled code-switched data for a low-resourced code-switching combinations adds difficulty to this problem. To overcome this, we release a synthetic code-switched textual dataset containing around 29k samples for training and a real-world dataset containing around 10k samples for testing for three language combinations en-fr, en-es, and en-de. In this paper, we describe our algorithm for creating synthetic code-switched offensive content data and the process for creating the human-generated data. We also introduce the results of a keyword classification baseline and a multi-lingual transformer-based classification model.
Paper Type: short
0 Replies

Loading