Abstract: The large language models (LLMs) have showcased impressive capabilities in a broad spectrum of text generation tasks. However, their performance have raised significant safety concerns, particularly regarding the inadvertent generation of offensive or sensitive content. In this paper, we present an approach rooted in genetic algorithms to elicit offensive outputs from LLMs. The proposed method combines genetic algorithms with prompt injection attacks, a technique where specially crafted inputs are used to elicit specific responses from models to identify prompts that may elicit potentially offensive responses from language models. Our approach iteratively mutates and combines prompts from “Instruction” and “poison-prompt” datasets, evaluating the model’s responses to pinpoint a substantial volume of potentially inappropriate responses generated by LLMs. We conducted tests on several well-known Chinese large language models, including ChatGLM, Baichuan and MOSS, revealing that our method has a 15% likelihood of eliciting offensive outputs from these models, highlighting variations across different LLMs. This study underscores the potential for significantly enhancing the safety profiles of LLMs by addressing vulnerabilities identified through our method, thereby contributing to the development of safer AI technologies.
Loading