Keywords: Reinforcement Learning, Moral Alignment, Morality, Cooperative AI, LLM fine-tuning, Multi-agent social dilemmas, LLM, Generalisation, AI Alignment
Abstract: Can a model learn to be moral by playing games?
While existing alignment methods rely predomi-
nantly on learned preference signals and opaque
moral values, we investigate whether fine-tuning
with explicitly defined moral rewards can in-
duce transferable cooperative dispositions in LLM
agents. Generalization is evaluated across three
dimensions: strategic complexity, model capabil-
ity, and naturalistic complexity. We show that an
LLM finetuned exclusively on numerical multi-
agent games (with no natural language moral con-
tent), reduces harmful actions by up to 35% in
semantically unrelated interactive environments.
However, this generalization occurs only if train-
ing on iterated public goods games but not pair-
wise reciprocity games, and if environment com-
plexity is matched to model capability. Our results
provide evidence that intrinsic moral fine-tuning
is a promising direction for LLM alignment, and
offer preliminary answers to the questions: which
environments work, for which models, and why.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 148
Loading