Open Character Training: Shaping the Persona of AI Assistants Through Constitutional AI

ICLR 2026 Conference Submission22267 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: llm, persona, character training, large language models, alignment, value alignment, ai safety, ai ethics, constitutional ai, open-source
TL;DR: We introduce the first open-source implementation of Character Training, shaping the values, beliefs, and ethics of the assistant persona in a more effective and controlled manner than alternatives like prompting or activation steering.
Abstract: The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as Character Training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even malevolent. With our methods, the expression of these personas is more robust to adversarial prompting than the above two alternatives, while also leading to more coherent and realistic generations. Additionally, we demonstrate this fine-tuning has little to no effect on general capabilities as measured by common benchmarks. Finally, we also introduce a new method to track changes in character by analyzing the revealed preferences of the assistant, uncovering a clear and holistic change induced by our approach. We describe and open-source our full post-training method, the implementation of which can be found at https://anonymous.4open.science/r/OpenCharacterTraining.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22267
Loading