Pedagogical Games: Paths to Generalisation for Agentic Moral Alignment

Published: 02 Jun 2026, Last Modified: 11 Jun 2026Pluralistic-Alignment 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Moral Alignment, Morality, Cooperative AI, LLM fine-tuning, Multi-agent social dilemmas, LLM, Generalisation, AI Alignment
Abstract: Can a model learn to be moral by playing games? While existing alignment methods rely predomi- nantly on learned preference signals and opaque moral values, we investigate whether fine-tuning with explicitly defined moral rewards can in- duce transferable cooperative dispositions in LLM agents. Generalization is evaluated across three dimensions: strategic complexity, model capabil- ity, and naturalistic complexity. We show that an LLM finetuned exclusively on numerical multi- agent games (with no natural language moral con- tent), reduces harmful actions by up to 35% in semantically unrelated interactive environments. However, this generalization occurs only if train- ing on iterated public goods games but not pair- wise reciprocity games, and if environment com- plexity is matched to model capability. Our results provide evidence that intrinsic moral fine-tuning is a promising direction for LLM alignment, and offer preliminary answers to the questions: which environments work, for which models, and why.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 148
Loading