Border of Speech: A Benchmark Dataset for Understanding Chinese Offensive Speech

Border of Speech: A Benchmark Dataset for Understanding Chinese Offensive Speech

ACL ARR 2025 February Submission288 Authors

06 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Offensive Speech Detection (OSD) has been a prominent research topic in NLP. However, the development of Chinese OSD is constrained by the lack of sufficient benchmark datasets. Moreover, Chinese OSD faces challenges such as ambiguity, context dependence, and particularly the identification of Implicit Offensive Speech. To address these challenges, we introduce a fine-grained labeling system for 10 categories of implicit offensive speech, grounded in linguistic principles, and present SinOffen, a comprehensive real-world Chinese offensive speech dataset constructed based on this system. We evaluate the performance of mainstream pre-trained language models (PLMs) and generative large language models (LLMs) on this task, and investigate the impact of different prompt templates on model performance. Our work highlights the urgent need to develop more refined detection methods that can accommodate Chinese implicit speech, in order to counter the evolving evasion strategies.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: NLP datasets, benchmarking

Contribution Types: Data resources, Data analysis

Languages Studied: Chinese, English

Submission Number: 288

Loading