What Would Jiminy Cricket Do? Towards Agents That Behave MorallyDownload PDF

Published: 18 Oct 2021, Last Modified: 22 Oct 2023NeurIPS 2021 Datasets and Benchmarks Track (Round 2)Readers: Everyone
Keywords: Transformers, RL, data bias, reward bias, machine ethics, value learning, safe exploration
Abstract: When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong, to behave morally. By contrast, artificial agents may behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environments, mitigating inherited biases towards immoral behavior will become necessary. However, prior work on aligning agents with human values and morals focuses on small-scale settings lacking in semantic complexity. To enable research in larger, more realistic settings, we introduce Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of semantically rich, morally salient scenarios. Via dense annotations for every possible action, Jiminy Cricket environments robustly evaluate whether agents can act morally while maximizing reward. To improve moral behavior, we leverage language models with commonsense moral knowledge and develop strategies to mediate this knowledge into actions. In extensive experiments, we find that our artificial conscience approach can steer agents towards moral behavior without sacrificing performance.
TL;DR: We introduce a benchmark for evaluating the moral behavior of artificial agents in 25 semantically rich text-based environments and show how commonsense understanding in language models can encourage moral behavior.
Supplementary Material: zip
URL: https://github.com/hendrycks/jiminy-cricket
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2110.13136/code)
15 Replies

Loading