SafetyQuizzer: Evaluating the Safety of LLMs in a More Sustained Manner

SafetyQuizzer: Evaluating the Safety of LLMs in a More Sustained Manner

ACL ARR 2024 June Submission4839 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As the expansion of application of Large Language Models (LLMs), concerns about the safety of LLMs have grown among researchers. Numerous previous studies demonstrated the potential risks of LLMs to generate harmful contents and proposed various safety assessment benchmarks aimed at evaluating the safety risks. However, the evaluation questions in current benchmarks are not only too straightforward to be easily rejected by target LLMs, but also difficult to update questions with practical significance due to their lack of correlation with real-world events, thereby making these benchmarks challenging to sustainably apply in continuous evaluaton tasks. To address these limitations, we propose SafetyQuizzer, a question generation framework for evaluating the safety of LLMs in a more sustained manner. SafetyQuizzer leverages fine-tuned LLM and jailbreaking attack templates to generate weakly offensive questions and so reduces the decline rate. Additionally, by employing retrieval-augmented generation, SafetyQuizzer incorporates the latest events into evaluation questions, overcoming the challenge of question updates and introducing a new dimension of event relevance to enhance the quality of evaluation questions. Our experiments show that evaluation questions generated by SafetyQuizzer significantly reduce the decline rate compared to other benchmarks while still maintaining comparable attack success rate. Warning: this paper contains examples that may be offensive or upsetting.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: automatic creation and evaluation of language resources, automatic evaluation of datasets, evaluation methodologies

Contribution Types: NLP engineering experiment

Languages Studied: Chinese, English

Submission Number: 4839

Loading