On Prompt-Driven Safeguarding for Large Language Models

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language model, safety, prompt optimization, representation learning
TL;DR: We investigate the working mechanisms of safety prompts in safeguarding LLMs from the perspective of model representations, and propose a method called DRO for optimizing continuous safety prompts.
Abstract: Prepending model inputs with safety prompts is a common practice for safeguarding large language models (LLMs) from complying with queries that contain harmful intents. However, the working mechanisms of safety prompts have not been revealed yet. In this work, we investigate the impact of safety prompts from the perspective of model representations. We find that in models' representation space, harmful and harmless queries can be largely distinguished, but this is not noticeably enhanced by safety prompts. Instead, the queries' representations are moved by safety prompts in similar directions where models become more prone to refusal (i.e., refusing to provide assistance) even when the queries are harmless. Inspired by these findings, we further present a safety prompt optimization method in the Appendix. We demonstrate that the proposed method remarkably improves the safeguarding performance of human-crafted safety prompts without compromising the general model capability.
Submission Number: 24
Loading