Communication & Trust

09 Jul 2025 (modified: 20 Sept 2025)ODYSSEY 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Decision Theory, AI Alignment, AI Safety, Agent Foundations, Reflective Consistency, Tiling Agents, Updateless Decision Theory
TL;DR: Proves a new theorem establishing self-trust for Updateless Decision Theory agents, under conditions relating to self-communication.
Abstract: Yudkowsky suggested the criterion of reflective consistency for decision theories (roughly: does a decision theory choose itself?) [Yud10]. Dai proposed Updateless Decision Theory (UDT) as a response to Yudkowsky’s ideas [Dai09]. [DHR25] offered the first published proofs of reflective consistency for UDT. However, those results were not entirely satisfying, due to their reliance on strong assumptions. The current work offers a new attempt based on a formalism inspired by Critch’s notion of agent boundaries [Cri22] as well as Garrabrant’s work on Cartesian Frames [GHLW21] and Finite Factored Sets [Gar21]. The approach here uses communication between agent-moments as a “release valve” for pressures which could otherwise lead to self-modification.
Reviewer Suggestions: Andrew Critch, HealthcareAgents, critch@healthcareagents.com Daniel A. Hermann, UNC Chapel Hill, daherrma@uci.edu Steve Petersen, Niagra University, spetey@gmail.com Paul Christiano, US AI Safety Institute, paulfchristiano@gmail.com Lukas Finnveden, Open Philanthropy, finnveden.lukas@gmail.com Caspar Oesterheld, Carnegie Mellon, caspar.oesterheld@googlemail.com Ben Levinstein, University of Illinois at Urbana-Champaign, balevinstein@gmail.com
Serve As Reviewer: ~Abram_Demski1
Confirmation: I confirm that I and my co-authors have read the policies are releasing our work under a CC-BY 4.0 license.
Submission Number: 27
Loading