Improving Automated LLM Evaluation by Introducing Personas in LLM Red-Teaming

Wesley Deng; Sunnie S. Y. Kim; Akshita Jha; Ken Holstein; Motahhare Eslami; Lauren Wilcox; Leon Alexander Gatys

Improving Automated LLM Evaluation by Introducing Personas in LLM Red-Teaming

Wesley Deng, Sunnie S. Y. Kim, Akshita Jha, Ken Holstein, Motahhare Eslami, Lauren Wilcox, Leon Alexander Gatys

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Responsible AI, Red teaming, AI Personas, LLM Evaluation

TL;DR: An automated LLM red-teaming and evaluation approach that incorporates simulated person to improve attack potency and diversity

Abstract: Recent developments in AI safety and Responsible AI research have called for red-teaming methods that can effectively surface potential risks posed by LLMs. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming and more broad LLM evaluation, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss future work on improving LLM red-teaming and evaluation based on PersonaTeaming and our experiments.

Submission Number: 86

Loading