Narrow RL Induces Broad Behavior Changes in LLMs

Jo J. Jiao; Austin C. Kozlowski; James Evans

Narrow RL Induces Broad Behavior Changes in LLMs

Jo J. Jiao, Austin C. Kozlowski, James Evans

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: alignment, rl, evaluations, behavioral drift, large language models, social preferences

TL;DR: RL-optimization on a single multi-turn game (iterated Prisoner’s Dilemma) leads a LLM to behaves more selfishly in out-of-domain social preferences tasks.

Abstract: We study whether reinforcement learning (RL) on a narrow objective induces broader behavioral shifts in large language models. We apply RL to maximize the model's payoff in the iterated Prisoner's Dilemma against a cooperative opponent, leading to defective behaviors. We then evaluate out-of-domain social preference tasks: the Dictator Game, Social Value Orientation, and the Narcissistic Admiration and Rivalry Questionnaire. Relative to the pre-RL model, the RL-trained model shows a consistent increase in selfish and individualistic behavior. The results suggest that narrow RL can shift latent social preferences beyond the optimized task.

Submission Number: 161

Loading