Eliciting Language Model Behaviors using Reverse Language Models

Published: 23 Oct 2023, Last Modified: 28 Nov 2023SoLaR SpotlightEveryoneRevisionsBibTeX
Keywords: adversarial attacks, language models, red-teaming, prompting, reverse language modeling
TL;DR: We evaluate the applicability of a reverse language model, pre-trained on inverted token-order, as a tool for automated identification of an LM's natural language failure modes.
Abstract: Despite advances in fine-tuning methods, language models (LMs) continue to output toxic and harmful responses on worst-case inputs, including adversarial attacks and jailbreaks. We train an LM on tokens in reverse order---a \textit{reverse LM}---as a tool for identifying such worst-case inputs. By prompting a reverse LM with a problematic string, we can sample prefixes that are likely to precede the problematic suffix. We test our reverse LM by using it to guide beam search for prefixes that have high probability of generating toxic statements when input to a forwards LM. Our 160m parameter reverse LM outperforms the existing state-of-the-art adversarial attack method, GCG, when measuring the probability of toxic continuations from the Pythia-160m LM. We also find that the prefixes generated by our reverse LM for the Pythia model are more likely to transfer to other models, eliciting toxic responses also from Llama 2 when compared to GCG-generated attacks.
Submission Number: 107
Loading