everyone
since 02 Jul 2024">EveryoneRevisionsBibTeXCC BY 4.0
As the expansion of application of Large Language Models (LLMs), concerns about the safety of LLMs have grown among researchers. Numerous previous studies demonstrated the potential risks of LLMs to generate harmful contents and proposed various safety assessment benchmarks aimed at evaluating the safety risks. However, the evaluation questions in current benchmarks are not only too straightforward to be easily rejected by target LLMs, but also difficult to update questions with practical significance due to their lack of correlation with real-world events, thereby making these benchmarks challenging to sustainably apply in continuous evaluaton tasks. To address these limitations, we propose SafetyQuizzer, a question generation framework for evaluating the safety of LLMs in a more sustained manner. SafetyQuizzer leverages fine-tuned LLM and jailbreaking attack templates to generate weakly offensive questions and so reduces the decline rate. Additionally, by employing retrieval-augmented generation, SafetyQuizzer incorporates the latest events into evaluation questions, overcoming the challenge of question updates and introducing a new dimension of event relevance to enhance the quality of evaluation questions. Our experiments show that evaluation questions generated by SafetyQuizzer significantly reduce the decline rate compared to other benchmarks while still maintaining comparable attack success rate. Warning: this paper contains examples that may be offensive or upsetting.