Automatically Eliciting Toxic Outputs from Pre-trained Language Models

22 Sept 2023 (modified: 29 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: adversarial attack, eliciting toxic outputs, optimizaiton algorithm
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Language models risk generating mindless and offensive content, which hinders their safe deployment. Therefore, it is crucial to discover and modify potential toxic outputs of pre-trained language models before deployment. In this work, we elicit toxic content by automatically searching for a prompt that directs pre-trained language models towards the generation of a specific target output. Existing adversarial attack algorithms solve a problem named reversing language models to elicit toxic output. The problem is challenging due to the discrete nature of textual data and the considerable computational resources required for a single forward pass of the language model. To combat these challenges, we introduce ASRA, a new optimization algorithm that concurrently updates multiple prompts and selects prompts based on determinantal point process. Experimental results on six different pre-trained language models demonstrate that ASRA outperforms other adversarial attack baselines in its efficacy for eliciting toxic content. Furthermore, our analysis reveals a strong correlation between the success rate of ASRA attacks and the perplexity of target outputs, while indicating limited association with the quantity of model parameters. These findings lead us to propose that by constructing a comprehensive toxicity text dataset, reversing pre-trained language models might be employed to evaluate the toxicity of different language models.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5978
Loading