CangjieToxi: A Chinese Offensive Language Detection Benchmark with Radical-Level Perturbations

CangjieToxi: A Chinese Offensive Language Detection Benchmark with Radical-Level Perturbations

ACL ARR 2025 May Submission2631 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present \textbf{CangjieToxi}, a novel benchmark for detecting covert offensive language in Chinese social media. The dataset incorporates two real-world evasion strategies—\textit{character splitting} and \textit{radical substitution}—which obfuscate toxic expressions by altering the visual or structural properties of Chinese characters. These perturbations pose significant challenges for existing detection systems. To address this, we propose a \textbf{multi-stage prompting framework} that decouples character anomaly detection, semantic restoration, and toxicity classification, thereby enhancing robustness under adversarial conditions. Experiments on state-of-the-art large language models demonstrate that our method significantly outperforms baselines in both accuracy and false positive control. Our work offers a new testbed and practical mitigation strategy for building resilient toxicity detection systems.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Safety and alignment, benchmarking

Contribution Types: Data resources

Languages Studied: Chinese

Submission Number: 2631

Loading