We Urgently Need Intrinsically Kind Machines

Published: 09 Oct 2024, Last Modified: 02 Dec 2024NeurIPS 2024 Workshop IMOL asTinyPaperPosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Full track
Keywords: Alignment, Kindness, Intrinsic Motivation
TL;DR: This paper proposes embedding kindness as an intrinsic motivation in AI models to improve alignment with human values and mitigate the risks of misalignment between intrinsic and extrinsic goals.
Abstract: Artificial Intelligence systems are rapidly evolving, integrating extrinsic and intrinsic motivations. While these frameworks offer benefits, they risk misalignment at the algorithmic level while appearing superficially aligned with human values. In this paper, we argue that an intrinsic motivation for kindness is crucial for making sure these models are intrinsically aligned with human values. We argue that kindness, defined as the motivation to maximize the reward of others, can counteract any intrinsic motivations that might lead the model to prioritize itself over human well-being. Our approach introduces a framework and algorithm for embedding kindness into foundation models by simulating conversations. Limitations and future research directions for scalable implementation are discussed.
Submission Number: 40
Loading