Moral Alignment for LLM Agents

Elizaveta Tennant; Stephen Hailes; Mirco Musolesi

Moral Alignment for LLM Agents

Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

Published: 06 Mar 2025, Last Modified: 05 May 2025ICLR 2025 Bi-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: post-training, value alignment, moral value specification, customizable alignment, pluralistic alignment, moral decision-making, social dilemmas

TL;DR: We propose a framework for fine-tuning LLMs with intrinsic rewards and demonstrate that it is a promising general solution for aligning LLM agents to human values (which is more transparent and cost-effective than RLHF).

Abstract: Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and their transparency will decrease. Consequently, developing effective methods for aligning LLM agents to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are opaque, implicit and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly and transparently encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques.

Submission Type: Long Paper (9 Pages)

Archival Option: This is an archival submission

Presentation Venue Preference: ICLR 2025

Submission Number: 66

Loading