Value Alignment VerificationDownload PDF

Anonymous

16 Oct 2020 (modified: 05 May 2023)HAMLETS @ NeurIPS2020Readers: Everyone
Keywords: value alignment, trust, AI safety, verification and validation, human preferences
TL;DR: We study how a human can efficiently test whether the goals and behavior of another agent are aligned with the human’s values.
Abstract: As humans interact with autonomous agents to perform increasingly complicated, potentially risky tasks, it is important that humans can verify these agents' trustworthiness and efficiently evaluate their performance and correctness. In this paper we formalize the problem of value alignment verification: how can a human efficiently test whether the goals and behavior of another agent are aligned with the human's values? We explore several different value alignment verification settings and provide foundational theory regarding value alignment verification. We study alignment verification problems with idealized human testers that know their own reward function as well as value alignment verification problems where the human tester has implicit values. Our theoretical results and our empirical results in both a discrete grid navigation domain and a continuous autonomous driving domain demonstrate that it is possible to synthesize highly efficient and accurate value alignment verification tests for certifying the alignment of autonomous agents.
0 Replies

Loading