Side Effects of Character Training: Quantifying Cross-Constitution Drift in LLMs

Published: 02 Jun 2026, Last Modified: 02 Jun 2026Pluralistic-Alignment 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Alignment, Character Training, Constitutional AI, Pairwise Evaluation
Abstract: Character training is a key step in the post-training of industry-level large language models. Most character training pipelines utilize Constitutional AI in order to instill a set of traits or values into a language model, but the effectiveness of these pipelines is understudied. Additionally, fine-tuned language models have been shown to exhibit unintended side effects. We quantify these observations by employing EigenBench, a method for benchmarking language models' values which has been shown to produce meaningful signal about prompted or fine-tuned models. Using EigenBench, we evaluate N character trains on N constitutions, finding that most character-trained models do indeed instill their intended values, but not without side effects. Furthermore, prompting models instead can produce different effects, and we explore how prompting on top of character-training can mitigate harmful behaviors. Finally, we study the evolution of a model's character as it's progressively trained.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 126
Loading