You Are What Role You Play: Directing AI Values Through Role Assignment​

11 Sept 2025 (modified: 07 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: value, alignment, role, llm, judge, reward, safety, prompt, feedback
TL;DR: a value alignment method based on role based critic
Abstract: Principle-based (e.g Constitution Alignment \citep{bai2022constitutional}) alignment methods rely on fixed lists of values, but these are inevitably incomplete and lack context sensitivity. We propose role-conditioning as a compact alternative: roles like mother or judge implicitly encode both values and the cognition needed to apply them. Grounded in Theory of Mind (ToM), we formalize this view and prove that roles are strictly more expressive than principle lists in the ideal case. We then introduce a simple, training-free pipeline: a role-conditioned generator plus lightweight role-based critics for iterative refinement. Across five model families from small to large, validated on multiple safety benchmarks, this approach consistently outperforms principle-based, CoT, and hybrid baselines—cutting unsafe outputs (e.g improve by 3–20× (down to 3–10\%) on WildJailbreak). To investigate the effectiveness of our method, we conduct ablation studies examining role choices, different role combinations, the number of roles employed, and the impact of critic feedback iterations. We further explore how our approach can be synergistically combined with existing methods to achieve additional performance improvements. Additionally, we evaluate our method's effectiveness on a specialized agentic safety benchmark (AI blackmail), demonstrating its broader applicability. These results position roles as a simple, interpretable, yet powerful mechanism for directing AI values—offering both a paradigm shift in alignment approaches and a novel signal source for LLM-as-Judge construction.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 4051
Loading