Abstract: We present \textbf{CangjieToxi}, a novel benchmark for detecting covert offensive language in Chinese social media. The dataset incorporates two real-world evasion strategies—\textit{character splitting} and \textit{radical substitution}—which obfuscate toxic expressions by altering the visual or structural properties of Chinese characters. These perturbations pose significant challenges for existing detection systems. To address this, we propose a \textbf{multi-stage prompting framework} that decouples character anomaly detection, semantic restoration, and toxicity classification, thereby enhancing robustness under adversarial conditions. Experiments on state-of-the-art large language models demonstrate that our method significantly outperforms baselines in both accuracy and false positive control. Our work offers a new testbed and practical mitigation strategy for building resilient toxicity detection systems.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Safety and alignment, benchmarking
Contribution Types: Data resources
Languages Studied: Chinese
Submission Number: 2631
Loading