Towards shutdownable agents via stochastic choice

TMLR Paper5661 Authors

17 Aug 2025 (modified: 27 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be 'USEFUL'), and (2) choose stochastically between different trajectory-lengths (be NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Dennis_J._N._J._Soemers1
Submission Number: 5661
Loading