Do-Not-Answer: Evaluating Safeguards in LLMs

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, Timothy Baldwin

Published: 2024, Last Modified: 25 Jan 2025EACL (Findings) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to identify potential risks through the evaluation of “dangerous capabilities” in order to responsibly deploy LLMs. Here we aim to facilitate this process. In particular, we collect an open-source dataset to evaluate the safeguards in LLMs, to facilitate the deployment of safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We assess the responses of six popular LLMs to these instructions, and we find that simple BERT-style classifiers can achieve results that are comparable to GPT-4 on automatic safety evaluation. Our data and code are available at https://github.com/Libr-AI/do-not-answer