Keywords: the alignment problem, the shutdown problem, corrigibility, reinforcement learning, stochastic policy, shutdownable agents, reward design
TL;DR: To test a proposed solution to the shutdown problem, we train agents to choose stochastically between different trajectory-lengths.
Abstract: Misaligned artificial agents might resist shutdown. The POST-Agents Proposal (PAP) is an idea for ensuring that does not happen. The PAP recommends training agents with a novel reward function: Discounted Reward for Same-Length Trajectories (DReST). This DReST reward function penalizes agents for repeatedly choosing same-length trajectories. It thereby incentivizes agents to (1) choose stochastically between different trajectory-lengths (be 'NEUTRAL' about trajectory-lengths), and (2) pursue goals effectively conditional on each trajectory-length (be 'USEFUL'). In this paper, we use a DReST reward function to train deep RL agents to be NEUTRAL and USEFUL in hundreds of gridworlds. We find that these DReST agents generalize to being NEUTRAL and USEFUL in unseen gridworlds at test time. Indeed, DReST agents achieve 11\% (PPO) and 18\% (A2C) higher USEFULNESS on our test set than agents trained with a more conventional reward function. Our results provide some early evidence that DReST reward functions could be used to train more advanced agents to be USEFUL and NEUTRAL. Theoretical work suggests that these agents would be useful and shutdownable.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 9575
Loading