Keywords: Decision Theory, AI Alignment, AI Safety, Agent Foundations, Reflective Consistency, Tiling Agents, Updateless Decision Theory
TL;DR: Proves a new theorem establishing self-trust for Updateless Decision Theory agents, under conditions relating to self-communication.
Abstract: Yudkowsky suggested the criterion of reflective consistency for decision theories (roughly: does a decision theory choose itself?) [Yud10]. Dai proposed Updateless Decision Theory (UDT) as a response to Yudkowsky’s ideas [Dai09]. [DHR25] offered the first published proofs of reflective consistency for UDT. However, those results were not entirely satisfying, due to their reliance on strong assumptions. The current work offers a new attempt based on a formalism inspired by Critch’s notion of agent boundaries [Cri22] as well as Garrabrant’s work on Cartesian Frames [GHLW21] and Finite Factored Sets [Gar21]. The approach here uses communication between agent-moments as a “release valve” for pressures which could otherwise lead to self-modification.
Reviewer Suggestions: Andrew Critch, HealthcareAgents, critch@healthcareagents.com
Daniel A. Hermann, UNC Chapel Hill, daherrma@uci.edu
Steve Petersen, Niagra University, spetey@gmail.com
Paul Christiano, US AI Safety Institute, paulfchristiano@gmail.com
Lukas Finnveden, Open Philanthropy, finnveden.lukas@gmail.com
Caspar Oesterheld, Carnegie Mellon, caspar.oesterheld@googlemail.com
Ben Levinstein, University of Illinois at Urbana-Champaign, balevinstein@gmail.com
Serve As Reviewer: ~Abram_Demski1
Confirmation: I confirm that I and my co-authors have read the policies are releasing our work under a CC-BY 4.0 license.
Submission Number: 27
Loading