Keywords: interactive proofs, game theory, neural networks, safety, multi-agent reinforcement learning
TL;DR: We study how a trusted, weak model can learn to interact with one or more stronger but untrusted models in order to solve a given task.
Abstract: We consider the problem of how a trusted, but computationally bounded agent (a ‘verifier’) can learn to interact with one or more powerful but untrusted agents (‘provers’) in order to solve a given task without being misled. More specifically, we study the case in which agents are represented using neural networks and refer to solutions of this problem as neural interactive proofs. First we introduce a unifying framework based on prover-verifier games (Anil et al., 2021), which generalises previously proposed interaction ‘protocols’. We then describe several new protocols for generating neural interactive proofs, and provide a (theoretical) comparison of both new and existing approaches. In so doing, we aim to create a foundation for future work on neural interactive proofs and their application in building safer AI systems.
Submission Number: 140
Loading