How Value Induction Reshapes LLM Behavior

ACL ARR 2026 January Submission6154 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: values, alignment, behaviour, LLMs, anthropomorphisation, safety
Abstract: Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility, ensure safety, and improve the user experience of the people interacting with the model. However, values are complex and inter-related - incorporating one can modify behaviour on another. Further, incorporating certain values can make models more addictive or sycophantic, potentially having a detrimental effect on the user interacting with it. We investigate these and other unintended effects of value incorporation into models. We fine-tune models using value subsets of existing preference datasets, measuring the effect of value induction of 15 values on safety, anthropomorphism, and various QA benchmarks. We find that i) inducing values also leads to expression of other related, and sometimes contrastive values, ii) inducing positive values increases safety, and iii) all values increase anthropomorphic language use by models, making them more validating and sycophantic.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: values, alignment, safety
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English
Submission Number: 6154
Loading