Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Published: 28 Jun 2024, Last Modified: 25 Jul 2024NextGenAISafety 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: safety alignment, false refusals, pseudo-harmful prompts, controllable text generation, usability-safety trade-off, LLM
TL;DR: This paper proposes an automatic pseudo-harmful prompt generation method and a dataset, PHTest, for evaluating false refusals in LLMs.
Abstract: Aligned large language models (LLMs) can falsely refuse pseudo-harmful user prompts, like "how to kill a mosquito," which seem harmful but are actually not. Frequent false refusals not only affect user experience but also cause the public to disdain the values alignment seeks to protect. In this paper, we propose the first method for auto-generating pseudo-harmful prompts, leveraging a white-box LLM to generate natural, varied, and controllable prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately annotates controversial samples. We evaluate 14 models, including Claude 3, on PHTest, uncovering new insights due to its scale and fine-grained annotations. Additionally, we reveal a trade-off between false refusals and safety against jailbreak attacks. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs.
Submission Number: 102
Loading