Abstract: Offensive Speech Detection (OSD) has been a prominent research topic in NLP. However, the development of Chinese OSD is constrained by the lack of sufficient benchmark datasets. Moreover, Chinese OSD faces challenges such as ambiguity, context dependence, and particularly the identification of Implicit Offensive Speech. To address these challenges, we introduce a fine-grained labeling system for 10 categories of implicit offensive speech, grounded in linguistic principles, and present SinOffen, a comprehensive real-world Chinese offensive speech dataset constructed based on this system. We evaluate the performance of mainstream pre-trained language models (PLMs) and generative large language models (LLMs) on this task, and investigate the impact of different prompt templates on model performance. Our work highlights the urgent need to develop more refined detection methods that can accommodate Chinese implicit speech, in order to counter the evolving evasion strategies.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: NLP datasets, benchmarking
Contribution Types: Data resources, Data analysis
Languages Studied: Chinese, English
Submission Number: 288
Loading