Narrow RL Induces Broad Behavior Changes in LLMs

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: alignment, rl, evaluations, behavioral drift, large language models, social preferences
TL;DR: RL-optimization on a single multi-turn game (iterated Prisoner’s Dilemma) leads a LLM to behaves more selfishly in out-of-domain social preferences tasks.
Abstract: We study whether reinforcement learning (RL) on a narrow objective induces broader behavioral shifts in large language models. We apply RL to maximize the model's payoff in the iterated Prisoner's Dilemma against a cooperative opponent, leading to defective behaviors. We then evaluate out-of-domain social preference tasks: the Dictator Game, Social Value Orientation, and the Narcissistic Admiration and Rivalry Questionnaire. Relative to the pre-RL model, the RL-trained model shows a consistent increase in selfish and individualistic behavior. The results suggest that narrow RL can shift latent social preferences beyond the optimized task.
Submission Number: 161
Loading