Who's asking? User personas and the mechanics of latent misalignment

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 spotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: safety, interpretability, explainability, NLP, alignment, activation engineering, jailbreaking
TL;DR: Decoding from earlier layers in LLMs recovers harmful content that would have been blocked, and LLMs answer harmful queries posed by some groups of users but not others.
Abstract: Studies show that safety-tuned models may nevertheless divulge harmful information. In this work, we show that whether they do so depends significantly on who they are talking to, which we refer to as *user persona*. In fact, we find manipulating user persona to be more effective for eliciting harmful content than certain more direct attempts to control model refusal. We study both natural language prompting and activation steering as intervention methods and show that activation steering is significantly more effective at bypassing safety filters. We shed light on the mechanics of this phenomenon by showing that even when model generations are safe, harmful content can persist in hidden representations and can be extracted by decoding from earlier layers. We also show we can predict a persona’s effect on refusal given only the geometry of its steering vector. Finally, we show that certain user personas induce the model to form more charitable interpretations of otherwise dangerous queries.
Supplementary Material: zip
Primary Area: Safety in machine learning
Submission Number: 19693
Loading