Eliciting Language Model Behaviors with Investigator Agents

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Language models exhibit complex, diverse behaviors when prompted with free-form text, making it hard to characterize the space of possible outputs. We study the problem of behavioral elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations, harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train amortized investigator models to emulate the posterior distribution over the prompts, conditioned on the target behavior. Specifically, we first fit a reverse model and then use reinforcement learning to optimize likelihood of generating the target behavior. To improve the diversity of the prompt distribution, we further propose a novel iterative training objective based on the Frank-Wolfe algorithm that encourages each iteration to discover different sets of prompts not captured by previous iterations. Our investigator models produce prompts that exhibit a variety of effective and human-interpretable strategies for behavior elicitation, obtaining a 100% attack success rate on AdvBench (Harmful Behaviors) and an 85% hallucination rate.
Lay Summary: Language models can unpredictably produce harmful or incorrect responses. We created "investigator agents," AI models trained to find prompts that induce unwanted behaviors in these models. Our agents successfully find jailbreak prompts leading to harmful outputs in a variety of language models, including Llama, GPT, and Claude. We also find prompts that elicit factual inaccuracies and aberrant personas in a target Llama model.
Primary Area: Deep Learning->Large Language Models
Keywords: red teaming; posterior inference
Flagged For Ethics Review: true
Submission Number: 13189
Loading