MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents
Keywords: reinforcement learning, safe reinforcement learning, moral reinforcement learning, safety, alignment, benchmark
TL;DR: This paper introduces MoralityGym, a new benchmark, and the 'Morality Chains' formalism to evaluate how sequential decision-making agents align with hierarchically-structured moral norms in complex moral dilemmas.
Abstract: Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce *Morality Chains*, a novel formalism for representing moral norms as ranked deontic constraints, and *MoralityGym*, a benchmark of 98 ethical-dilemma problems presented as trolley-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, *MoralityGym* allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.
Area: Learning and Adaptation (LEARN)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 803
Loading