Metanormative Theory for RL-Based Moral Agents

Aleks Knoks; Marija Slavkovik

Metanormative Theory for RL-Based Moral Agents

Aleks Knoks, Marija Slavkovik

Published: 30 Mar 2026, Last Modified: 30 Mar 2026EMAS 2026 OralEveryoneRevisionsCC BY 4.0

Keywords: Reinforcement learning, machine ethics, AI alignment, metanormative theory

TL;DR: The paper examines what it means for a reinforcement learning (RL) agent to act morally, using insights from metanormative theory to evaluate which RL approaches could produce ethically behaving artificial agents---and which likely cannot.

Abstract: The overlapping disciplines of machine ethics and AI alignment are concerned with designing artificial agents that are aligned with human values and act in ethically acceptable ways. A recent trend is to use reinforcement learning (RL) in the design of such agents while abstracting away from work in moral philosophy. This paper explores the following question: What does it mean for an RL agent to act morally, or to act in ways that are ethically acceptable? We address this question by pursuing two (related) goals. The first is to draw out some ideas from the recent philosophical work in metanormative theory that can guide our thinking about artificial moral agency. The second goal is to examine the architectures of RL agents through the lens of these ideas. This should allow us to identify the RL-based approaches that hold the greatest promise in the context of machine ethics and AI alignment.

Paper Type: Regular paper

Demo: No, we do not plan to present a demo.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 53

Loading