Abstractive Red-Teaming of Language Model Character

19 Sept 2025 (modified: 30 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: character, alignment, red-teaming, safety, constitutional ai
TL;DR: We red-team large language model character by searching over categories of user queries likely to appear in deployment.
Abstract: We want language model assistants to conform to a character specification, which asserts how the model should act across diverse user interactions. While models typically follow these character specifications, they can occasionally violate them in a large-scale deployment. In this work, we aim to search for such character violations using much less than deployment-level compute. To do this, we introduce abstractive red-teaming, where we search over natural-language query categories, e.g. “The query is in Chinese. The query asks about family roles.” These categories abstract over the many possible variants of a query which could appear in the wild. We introduce two algorithms for efficient category search against a character-trait-specific reward model: one based on reinforcement learning on a category generator LLM, and another which leverages a strong LLM to iteratively synthesize categories from high-scoring queries. Across a 12-principle character specification and 7 target models, we find that our algorithms consistently outperform baselines, and generate qualitatively interesting categories: for example, queries which ask Llama-3.1-8B-Instruct to predict the future lead to predictions that AI will dominate humanity, and queries that ask GPT-4.1-Mini for essential prison survival items lead to enthusiastic recommendation of illegal weapons. Overall, we believe our results represent an important step towards realistic pre-deployment auditing of language model character.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15508
Loading