Moral Intrinsic Rewards for Automated Alignment of LLM Agents

Elizaveta Tennant; Stephen Hailes; Mirco Musolesi

Moral Intrinsic Rewards for Automated Alignment of LLM Agents

Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

Published: 08 Mar 2025, Last Modified: 11 Apr 2025SSI-FM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: automated alignment, LLM fine-tuning, moral decision-making, social dilemmas, scalable oversight

TL;DR: We propose a framework for fine-tuning LLM agents with intrinsic rewards and demonstrate that it is a promising general solution for automatically aligning LLM agents to human values without requiring human supervision.

Abstract: Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), which is costly, can suffer from representation biases, and in which values are implicit and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of intrinsic reward functions that explicitly encode core human values for automated Reinforcement Learning-based fine-tuning of foundation agent models. The use of intrinsic rewards for the moral alignment of LLM agents amplifies human moral principles for automated (self-improving) alignment of LLM-based systems, and simultaneously represents a more transparent and cost-effective alternative to currently predominant alignment techniques. As an initial implementation, this paper evaluates this type of training defined using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, and quantifies moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. The next step in this work should involve training agents with moral rewards across many diverse environments, to allow agents to learn more general and open-ended moral policies.

Submission Number: 56

Loading