Chinese legal adversarial text generation based on interpretable perturbation strategies

Yunting Zhang, Lin Ye, Shang Li, Yue Wu, Baisong Li, Hongli Zhang

Published: 01 Jan 2025, Last Modified: 17 Apr 2025World Wide Web (WWW) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Adversarial examples (AEs) are malicious samples that can deceive deep learning (DL) models. In the field of natural language processing, many existing methods generate AEs to evaluate the robustness of DL models. However, due to the lack of consideration for domain knowledge, the attack performance of previous methods significantly declines in vertical domains, such as law. To address this issue, we introduce legal knowledge to the framework of adversarial text generation based on word importance and propose a novel black-box interpretable Chinese legal adversarial text generation approach, Interpretable Chinese Law Tricker (ICLT). Firstly, we propose a knowledge extraction method based on TF-IDF. Specifically, we employ TF-IDF to extract the unique features from each category and shared features among different categories. At the same time, we provide two approaches to extracting legal knowledge, with human intervention and without human intervention, to meet different demands for quality or efficiency. Subsequently, we adopt two perturbation strategies to selectively perturb the two types of features, which can significantly reduce the classification performance of the target model with a few perturbed words. Additionally, to improve the readability of Chinese AEs, we design an efficient perturbation method, Character-level Replacement, according to Chinese linguistic characteristics. Experimental results demonstrate that ICLT can effectively and efficiently generate explainable Chinese legal adversarial texts that can be misclassified with high confidence, thereby achieving excellent attack performance against mainstream text classification models.